mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-27 19:12:00 +00:00
Unrestricted API usage can lead to runaway costs and fragmented client-side
throttling logic. This commit introduces a built-in quota mechanism at the
server level, enabling operators to centrally enforce per-client and anonymous
rate limits—without needing external proxies or client changes.
This helps contain compute costs, enforces fair usage, and simplifies deployment
and monitoring of Llama Stack services. Quotas are fully opt-in and have no
effect unless explicitly configured.
Currently, SQLite is the only supported KV store. If quotas are
configured but authentication is disabled, authenticated limits will
gracefully fall back to anonymous limits.
Highlights:
- Adds `QuotaMiddleware` to enforce request quotas:
- Uses bearer token as client ID if present; otherwise falls back to IP address
- Tracks requests in KV store with per-key TTL expiration
- Returns HTTP 429 if a client exceeds their quota
- Extends `ServerConfig` with a `quota` section:
- `kvstore`: configuration for the backend (currently only SQLite)
- `anonymous_max_requests`: per-period cap for unauthenticated clients
- `authenticated_max_requests`: per-period cap for authenticated clients
- `period`: duration of the quota window (currently only `day` is supported)
- Adds full test coverage with FastAPI `TestClient` and custom middleware injection
Behavior changes:
- Quotas are disabled by default unless explicitly configured
- Anonymous users get a conservative default quota; authenticated clients can be given more generous limits
To enable per-client request quotas in `run.yaml`, add:
```yaml
server:
port: 8321
auth:
provider_type: custom
config:
endpoint: https://auth.example.com/validate
quota:
kvstore:
type: sqlite
db_path: ./quotas.db
anonymous_max_requests: 100
authenticated_max_requests: 1000
period: day
```
Signed-off-by: Wen Liang <wenliang@redhat.com>
353 lines
12 KiB
Markdown
353 lines
12 KiB
Markdown
# Configuring a "Stack"
|
|
|
|
The Llama Stack runtime configuration is specified as a YAML file. Here is a simplified version of an example configuration file for the Ollama distribution:
|
|
|
|
```{dropdown} 👋 Click here for a Sample Configuration File
|
|
|
|
```yaml
|
|
version: 2
|
|
conda_env: ollama
|
|
apis:
|
|
- agents
|
|
- inference
|
|
- vector_io
|
|
- safety
|
|
- telemetry
|
|
providers:
|
|
inference:
|
|
- provider_id: ollama
|
|
provider_type: remote::ollama
|
|
config:
|
|
url: ${env.OLLAMA_URL:http://localhost:11434}
|
|
vector_io:
|
|
- provider_id: faiss
|
|
provider_type: inline::faiss
|
|
config:
|
|
kvstore:
|
|
type: sqlite
|
|
namespace: null
|
|
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/faiss_store.db
|
|
safety:
|
|
- provider_id: llama-guard
|
|
provider_type: inline::llama-guard
|
|
config: {}
|
|
agents:
|
|
- provider_id: meta-reference
|
|
provider_type: inline::meta-reference
|
|
config:
|
|
persistence_store:
|
|
type: sqlite
|
|
namespace: null
|
|
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/agents_store.db
|
|
telemetry:
|
|
- provider_id: meta-reference
|
|
provider_type: inline::meta-reference
|
|
config: {}
|
|
metadata_store:
|
|
namespace: null
|
|
type: sqlite
|
|
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/registry.db
|
|
models:
|
|
- metadata: {}
|
|
model_id: ${env.INFERENCE_MODEL}
|
|
provider_id: ollama
|
|
provider_model_id: null
|
|
shields: []
|
|
server:
|
|
port: 8321
|
|
auth:
|
|
provider_type: "kubernetes"
|
|
config:
|
|
api_server_url: "https://kubernetes.default.svc"
|
|
ca_cert_path: "/path/to/ca.crt"
|
|
```
|
|
|
|
Let's break this down into the different sections. The first section specifies the set of APIs that the stack server will serve:
|
|
```yaml
|
|
apis:
|
|
- agents
|
|
- inference
|
|
- memory
|
|
- safety
|
|
- telemetry
|
|
```
|
|
|
|
## Providers
|
|
Next up is the most critical part: the set of providers that the stack will use to serve the above APIs. Consider the `inference` API:
|
|
```yaml
|
|
providers:
|
|
inference:
|
|
# provider_id is a string you can choose freely
|
|
- provider_id: ollama
|
|
# provider_type is a string that specifies the type of provider.
|
|
# in this case, the provider for inference is ollama and it is run remotely (outside of the distribution)
|
|
provider_type: remote::ollama
|
|
# config is a dictionary that contains the configuration for the provider.
|
|
# in this case, the configuration is the url of the ollama server
|
|
config:
|
|
url: ${env.OLLAMA_URL:http://localhost:11434}
|
|
```
|
|
A few things to note:
|
|
- A _provider instance_ is identified with an (id, type, configuration) triplet.
|
|
- The id is a string you can choose freely.
|
|
- You can instantiate any number of provider instances of the same type.
|
|
- The configuration dictionary is provider-specific.
|
|
- Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via `llama stack run`), you can specify `--env OLLAMA_URL=http://my-server:11434` to override the default value.
|
|
|
|
## Resources
|
|
|
|
Finally, let's look at the `models` section:
|
|
|
|
```yaml
|
|
models:
|
|
- metadata: {}
|
|
model_id: ${env.INFERENCE_MODEL}
|
|
provider_id: ollama
|
|
provider_model_id: null
|
|
```
|
|
A Model is an instance of a "Resource" (see [Concepts](../concepts/index)) and is associated with a specific inference provider (in this case, the provider with identifier `ollama`). This is an instance of a "pre-registered" model. While we always encourage the clients to always register models before using them, some Stack servers may come up a list of "already known and available" models.
|
|
|
|
What's with the `provider_model_id` field? This is an identifier for the model inside the provider's model catalog. Contrast it with `model_id` which is the identifier for the same model for Llama Stack's purposes. For example, you may want to name "llama3.2:vision-11b" as "image_captioning_model" when you use it in your Stack interactions. When omitted, the server will set `provider_model_id` to be the same as `model_id`.
|
|
|
|
## Server Configuration
|
|
|
|
The `server` section configures the HTTP server that serves the Llama Stack APIs:
|
|
|
|
```yaml
|
|
server:
|
|
port: 8321 # Port to listen on (default: 8321)
|
|
tls_certfile: "/path/to/cert.pem" # Optional: Path to TLS certificate for HTTPS
|
|
tls_keyfile: "/path/to/key.pem" # Optional: Path to TLS key for HTTPS
|
|
auth: # Optional: Authentication configuration
|
|
provider_type: "kubernetes" # Type of auth provider
|
|
config: # Provider-specific configuration
|
|
api_server_url: "https://kubernetes.default.svc"
|
|
ca_cert_path: "/path/to/ca.crt" # Optional: Path to CA certificate
|
|
```
|
|
|
|
### Authentication Configuration
|
|
|
|
The `auth` section configures authentication for the server. When configured, all API requests must include a valid Bearer token in the Authorization header:
|
|
|
|
```
|
|
Authorization: Bearer <token>
|
|
```
|
|
|
|
The server supports multiple authentication providers:
|
|
|
|
#### Kubernetes Provider
|
|
|
|
The Kubernetes cluster must be configured to use a service account for authentication.
|
|
|
|
```bash
|
|
kubectl create namespace llama-stack
|
|
kubectl create serviceaccount llama-stack-auth -n llama-stack
|
|
kubectl create rolebinding llama-stack-auth-rolebinding --clusterrole=admin --serviceaccount=llama-stack:llama-stack-auth -n llama-stack
|
|
kubectl create token llama-stack-auth -n llama-stack > llama-stack-auth-token
|
|
```
|
|
|
|
Validates tokens against the Kubernetes API server:
|
|
```yaml
|
|
server:
|
|
auth:
|
|
provider_type: "kubernetes"
|
|
config:
|
|
api_server_url: "https://kubernetes.default.svc" # URL of the Kubernetes API server
|
|
ca_cert_path: "/path/to/ca.crt" # Optional: Path to CA certificate
|
|
```
|
|
|
|
The provider extracts user information from the JWT token:
|
|
- Username from the `sub` claim becomes a role
|
|
- Kubernetes groups become teams
|
|
|
|
You can easily validate a request by running:
|
|
|
|
```bash
|
|
curl -s -L -H "Authorization: Bearer $(cat llama-stack-auth-token)" http://127.0.0.1:8321/v1/providers
|
|
```
|
|
|
|
#### Custom Provider
|
|
Validates tokens against a custom authentication endpoint:
|
|
```yaml
|
|
server:
|
|
auth:
|
|
provider_type: "custom"
|
|
config:
|
|
endpoint: "https://auth.example.com/validate" # URL of the auth endpoint
|
|
```
|
|
|
|
The custom endpoint receives a POST request with:
|
|
```json
|
|
{
|
|
"api_key": "<token>",
|
|
"request": {
|
|
"path": "/api/v1/endpoint",
|
|
"headers": {
|
|
"content-type": "application/json",
|
|
"user-agent": "curl/7.64.1"
|
|
},
|
|
"params": {
|
|
"key": ["value"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
And must respond with:
|
|
```json
|
|
{
|
|
"access_attributes": {
|
|
"roles": ["admin", "user"],
|
|
"teams": ["ml-team", "nlp-team"],
|
|
"projects": ["llama-3", "project-x"],
|
|
"namespaces": ["research"]
|
|
},
|
|
"message": "Authentication successful"
|
|
}
|
|
```
|
|
|
|
If no access attributes are returned, the token is used as a namespace.
|
|
|
|
### Quota Configuration
|
|
|
|
The `quota` section allows you to enable server-side request throttling for both
|
|
authenticated and anonymous clients. This is useful for preventing abuse, enforcing
|
|
fairness across tenants, and controlling infrastructure costs without requiring
|
|
client-side rate limiting or external proxies.
|
|
|
|
Quotas are disabled by default. When enabled, each client is tracked using either:
|
|
|
|
* Their authenticated `client_id` (derived from the Bearer token), or
|
|
* Their IP address (fallback for anonymous requests)
|
|
|
|
Quota state is stored in a SQLite-backed key-value store, and rate limits are applied
|
|
within a configurable time window (currently only `day` is supported).
|
|
|
|
#### Example
|
|
|
|
```yaml
|
|
server:
|
|
quota:
|
|
kvstore:
|
|
type: sqlite
|
|
db_path: ./quotas.db
|
|
anonymous_max_requests: 100
|
|
authenticated_max_requests: 1000
|
|
period: day
|
|
```
|
|
|
|
#### Configuration Options
|
|
|
|
| Field | Description |
|
|
| ---------------------------- | -------------------------------------------------------------------------- |
|
|
| `kvstore` | Required. Backend storage config for tracking request counts. |
|
|
| `kvstore.type` | Must be `"sqlite"` for now. Other backends may be supported in the future. |
|
|
| `kvstore.db_path` | File path to the SQLite database. |
|
|
| `anonymous_max_requests` | Max requests per period for unauthenticated clients. |
|
|
| `authenticated_max_requests` | Max requests per period for authenticated clients. |
|
|
| `period` | Time window for quota enforcement. Only `"day"` is supported. |
|
|
|
|
> Note: if `authenticated_max_requests` is set but no authentication provider is
|
|
configured, the server will fall back to applying `anonymous_max_requests` to all
|
|
clients.
|
|
|
|
#### Example with Authentication Enabled
|
|
|
|
```yaml
|
|
server:
|
|
port: 8321
|
|
auth:
|
|
provider_type: custom
|
|
config:
|
|
endpoint: https://auth.example.com/validate
|
|
quota:
|
|
kvstore:
|
|
type: sqlite
|
|
db_path: ./quotas.db
|
|
anonymous_max_requests: 100
|
|
authenticated_max_requests: 1000
|
|
period: day
|
|
```
|
|
|
|
If a client exceeds their limit, the server responds with:
|
|
|
|
```http
|
|
HTTP/1.1 429 Too Many Requests
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"error": {
|
|
"message": "Quota exceeded"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Extending to handle Safety
|
|
|
|
Configuring Safety can be a little involved so it is instructive to go through an example.
|
|
|
|
The Safety API works with the associated Resource called a `Shield`. Providers can support various kinds of Shields. Good examples include the [Llama Guard](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/) system-safety models, or [Bedrock Guardrails](https://aws.amazon.com/bedrock/guardrails/).
|
|
|
|
To configure a Bedrock Shield, you would need to add:
|
|
- A Safety API provider instance with type `remote::bedrock`
|
|
- A Shield resource served by this provider.
|
|
|
|
```yaml
|
|
...
|
|
providers:
|
|
safety:
|
|
- provider_id: bedrock
|
|
provider_type: remote::bedrock
|
|
config:
|
|
aws_access_key_id: ${env.AWS_ACCESS_KEY_ID}
|
|
aws_secret_access_key: ${env.AWS_SECRET_ACCESS_KEY}
|
|
...
|
|
shields:
|
|
- provider_id: bedrock
|
|
params:
|
|
guardrailVersion: ${env.GUARDRAIL_VERSION}
|
|
provider_shield_id: ${env.GUARDRAIL_ID}
|
|
...
|
|
```
|
|
|
|
The situation is more involved if the Shield needs _Inference_ of an associated model. This is the case with Llama Guard. In that case, you would need to add:
|
|
- A Safety API provider instance with type `inline::llama-guard`
|
|
- An Inference API provider instance for serving the model.
|
|
- A Model resource associated with this provider.
|
|
- A Shield resource served by the Safety provider.
|
|
|
|
The yaml configuration for this setup, assuming you were using vLLM as your inference server, would look like:
|
|
```yaml
|
|
...
|
|
providers:
|
|
safety:
|
|
- provider_id: llama-guard
|
|
provider_type: inline::llama-guard
|
|
config: {}
|
|
inference:
|
|
# this vLLM server serves the "normal" inference model (e.g., llama3.2:3b)
|
|
- provider_id: vllm-0
|
|
provider_type: remote::vllm
|
|
config:
|
|
url: ${env.VLLM_URL:http://localhost:8000}
|
|
# this vLLM server serves the llama-guard model (e.g., llama-guard:3b)
|
|
- provider_id: vllm-1
|
|
provider_type: remote::vllm
|
|
config:
|
|
url: ${env.SAFETY_VLLM_URL:http://localhost:8001}
|
|
...
|
|
models:
|
|
- metadata: {}
|
|
model_id: ${env.INFERENCE_MODEL}
|
|
provider_id: vllm-0
|
|
provider_model_id: null
|
|
- metadata: {}
|
|
model_id: ${env.SAFETY_MODEL}
|
|
provider_id: vllm-1
|
|
provider_model_id: null
|
|
shields:
|
|
- provider_id: llama-guard
|
|
shield_id: ${env.SAFETY_MODEL} # Llama Guard shields are identified by the corresponding LlamaGuard model
|
|
provider_shield_id: null
|
|
...
|
|
```
|