Unrestricted API usage can lead to runaway costs and fragmented client-side
throttling logic. This commit introduces a built-in quota mechanism at the
server level, enabling operators to centrally enforce per-client and anonymous
rate limits—without needing external proxies or client changes.
This helps contain compute costs, enforces fair usage, and simplifies deployment
and monitoring of Llama Stack services. Quotas are fully opt-in and have no
effect unless explicitly configured.
Currently, SQLite is the only supported KV store. If quotas are
configured but authentication is disabled, authenticated limits will
gracefully fall back to anonymous limits.
Highlights:
- Adds `QuotaMiddleware` to enforce request quotas:
- Uses bearer token as client ID if present; otherwise falls back to IP address
- Tracks requests in KV store with per-key TTL expiration
- Returns HTTP 429 if a client exceeds their quota
- Extends `ServerConfig` with a `quota` section:
- `kvstore`: configuration for the backend (currently only SQLite)
- `anonymous_max_requests`: per-period cap for unauthenticated clients
- `authenticated_max_requests`: per-period cap for authenticated clients
- `period`: duration of the quota window (currently only `day` is supported)
- Adds full test coverage with FastAPI `TestClient` and custom middleware injection
Behavior changes:
- Quotas are disabled by default unless explicitly configured
- Anonymous users get a conservative default quota; authenticated clients can be given more generous limits
To enable per-client request quotas in `run.yaml`, add:
```yaml
server:
port: 8321
auth:
provider_type: custom
config:
endpoint: https://auth.example.com/validate
quota:
kvstore:
type: sqlite
db_path: ./quotas.db
anonymous_max_requests: 100
authenticated_max_requests: 1000
period: day
```
Signed-off-by: Wen Liang <wenliang@redhat.com>
# What does this PR do?
This commit adds a new authentication system to the Llama Stack server
with support for Kubernetes and custom authentication providers. Key
changes include:
- Implemented KubernetesAuthProvider for validating Kubernetes service
account tokens
- Implemented CustomAuthProvider for validating tokens against external
endpoints - this is the same code that was already present.
- Added test for Kubernetes
- Updated server configuration to support authentication settings
- Added documentation for authentication configuration and usage
The authentication system supports:
- Bearer token validation
- Kubernetes service account token validation
- Custom authentication endpoints
## Test Plan
Setup a Kube cluster using Kind or Minikube.
Run a server with:
```
server:
port: 8321
auth:
provider_type: kubernetes
config:
api_server_url: http://url
ca_cert_path: path/to/cert (optional)
```
Run:
```
curl -s -L -H "Authorization: Bearer $(kubectl create token my-user)" http://127.0.0.1:8321/v1/providers
```
Or replace "my-user" with your service account.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Another doc enhancement for
https://github.com/meta-llama/llama-stack/issues/1818
Summary of changes:
- `docs/source/distributions/configuration.md`
- Updated dropdown title to include a more user-friendly description.
- `docs/_static/css/my_theme.css`
- Added styling for `<h3>` elements to set a normal font weight.
- `docs/source/distributions/starting_llama_stack_server.md`
- Changed section headers from bold text to proper markdown headers
(e.g., `##`).
- Improved descriptions for starting Llama Stack server using different
methods (library, container, conda, Kubernetes).
- Enhanced clarity and structure by converting instructions into
markdown headers and improved formatting.
- `docs/source/getting_started/index.md`
- Major restructuring of the "Quick Start" guide:
- Added new introductory section for Llama Stack and its capabilities.
- Reorganized steps into clearer subsections with proper markdown
headers.
- Replaced dropdowns with tabbed content for OS-specific instructions.
- Added detailed steps for setting up and running the Llama Stack server
and client.
- Introduced new sections for running basic inference and building
agents.
- Enhanced readability and visual structure with emojis, admonitions,
and examples.
- `docs/source/providers/index.md`
- Updated the list of LLM inference providers to include "Ollama."
- Expanded the list of vector databases to include "SQLite-Vec."
Let me know if you need further details!
## Test Plan
Renders locally, included screenshot.
# Documentation
For https://github.com/meta-llama/llama-stack/issues/1818
<img width="1332" alt="Screenshot 2025-04-09 at 11 07 12 AM"
src="https://github.com/user-attachments/assets/c106efb9-076c-4059-a4e0-a30fa738585b"
/>
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
The goal of this PR is to make the pages easier to navigate by surfacing
the child pages on the navbar, updating some of the copy, moving some of
the files around.
Some changes:
1. Clarifying Titles
2. Restructuring "Distributions" more formally in its own page to be
consistent with Providers and adding some clarity to the child pages to
surface them and make them easier to navigate
3. Updated sphinx config to not collapse navigation by default
4. Updated copyright year to be calculated dynamically
5. Moved `docs/source/distributions/index.md` ->
`docs/source/distributions/starting_llama_stack_server.md`
Another for https://github.com/meta-llama/llama-stack/issues/1815
## Test Plan
Tested locally and pages build (screen shots for example).
## Documentation
### Before:

### After:

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Create a script for running all client-sdk tests on Async Library
client, with the option to generate report
## Test Plan
```
python llama_stack/scripts/run_client_sdk_tests.py --templates together fireworks --report
```
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.