mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-06-28 02:53:30 +00:00
# What does this PR do? feat(quota): add server‑side per‑client request quotas (requires auth) Unrestricted usage can lead to runaway costs and fragmented client-side workarounds. This commit introduces a native quota mechanism to the server, giving operators a unified, centrally managed throttle for per-client requests—without needing extra proxies or custom client logic. This helps contain cloud-compute expenses, enables fine-grained usage control, and simplifies deployment and monitoring of Llama Stack services. Quotas are fully opt-in and have no effect unless explicitly configured. Notice that Quotas are fully opt-in and require authentication to be enabled. The 'sqlite' is the only supported quota `type` at this time, any other `type` will be rejected. And the only supported `period` is 'day'. Highlights: - Adds `QuotaMiddleware` to enforce per-client request quotas: - Uses `Authorization: Bearer <client_id>` (from AuthenticationMiddleware) - Tracks usage via a SQLite-based KV store - Returns 429 when the quota is exceeded - Extends `ServerConfig` with a `quota` section (type + config) - Enforces strict coupling: quotas require authentication or the server will fail to start Behavior changes: - Quotas are disabled by default unless explicitly configured - SQLite defaults to `./quotas.db` if no DB path is set - The server requires authentication when quotas are enabled To enable per-client request quotas in `run.yaml`, add: ``` server: port: 8321 auth: provider_type: "custom" config: endpoint: "https://auth.example.com/validate" quota: type: sqlite config: db_path: ./quotas.db limit: max_requests: 1000 period: day [//]: # (If resolving an issue, uncomment and update the line below) Closes #2093 ## Test Plan [Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.*] [//]: # (## Documentation) Signed-off-by: Wen Liang <wenliang@redhat.com> Co-authored-by: Wen Liang <wenliang@redhat.com>
353 lines
12 KiB
Markdown
353 lines
12 KiB
Markdown
# Configuring a "Stack"
|
|
|
|
The Llama Stack runtime configuration is specified as a YAML file. Here is a simplified version of an example configuration file for the Ollama distribution:
|
|
|
|
```{dropdown} 👋 Click here for a Sample Configuration File
|
|
|
|
```yaml
|
|
version: 2
|
|
conda_env: ollama
|
|
apis:
|
|
- agents
|
|
- inference
|
|
- vector_io
|
|
- safety
|
|
- telemetry
|
|
providers:
|
|
inference:
|
|
- provider_id: ollama
|
|
provider_type: remote::ollama
|
|
config:
|
|
url: ${env.OLLAMA_URL:http://localhost:11434}
|
|
vector_io:
|
|
- provider_id: faiss
|
|
provider_type: inline::faiss
|
|
config:
|
|
kvstore:
|
|
type: sqlite
|
|
namespace: null
|
|
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/faiss_store.db
|
|
safety:
|
|
- provider_id: llama-guard
|
|
provider_type: inline::llama-guard
|
|
config: {}
|
|
agents:
|
|
- provider_id: meta-reference
|
|
provider_type: inline::meta-reference
|
|
config:
|
|
persistence_store:
|
|
type: sqlite
|
|
namespace: null
|
|
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/agents_store.db
|
|
telemetry:
|
|
- provider_id: meta-reference
|
|
provider_type: inline::meta-reference
|
|
config: {}
|
|
metadata_store:
|
|
namespace: null
|
|
type: sqlite
|
|
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/registry.db
|
|
models:
|
|
- metadata: {}
|
|
model_id: ${env.INFERENCE_MODEL}
|
|
provider_id: ollama
|
|
provider_model_id: null
|
|
shields: []
|
|
server:
|
|
port: 8321
|
|
auth:
|
|
provider_type: "kubernetes"
|
|
config:
|
|
api_server_url: "https://kubernetes.default.svc"
|
|
ca_cert_path: "/path/to/ca.crt"
|
|
```
|
|
|
|
Let's break this down into the different sections. The first section specifies the set of APIs that the stack server will serve:
|
|
```yaml
|
|
apis:
|
|
- agents
|
|
- inference
|
|
- memory
|
|
- safety
|
|
- telemetry
|
|
```
|
|
|
|
## Providers
|
|
Next up is the most critical part: the set of providers that the stack will use to serve the above APIs. Consider the `inference` API:
|
|
```yaml
|
|
providers:
|
|
inference:
|
|
# provider_id is a string you can choose freely
|
|
- provider_id: ollama
|
|
# provider_type is a string that specifies the type of provider.
|
|
# in this case, the provider for inference is ollama and it is run remotely (outside of the distribution)
|
|
provider_type: remote::ollama
|
|
# config is a dictionary that contains the configuration for the provider.
|
|
# in this case, the configuration is the url of the ollama server
|
|
config:
|
|
url: ${env.OLLAMA_URL:http://localhost:11434}
|
|
```
|
|
A few things to note:
|
|
- A _provider instance_ is identified with an (id, type, configuration) triplet.
|
|
- The id is a string you can choose freely.
|
|
- You can instantiate any number of provider instances of the same type.
|
|
- The configuration dictionary is provider-specific.
|
|
- Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via `llama stack run`), you can specify `--env OLLAMA_URL=http://my-server:11434` to override the default value.
|
|
|
|
## Resources
|
|
|
|
Finally, let's look at the `models` section:
|
|
|
|
```yaml
|
|
models:
|
|
- metadata: {}
|
|
model_id: ${env.INFERENCE_MODEL}
|
|
provider_id: ollama
|
|
provider_model_id: null
|
|
```
|
|
A Model is an instance of a "Resource" (see [Concepts](../concepts/index)) and is associated with a specific inference provider (in this case, the provider with identifier `ollama`). This is an instance of a "pre-registered" model. While we always encourage the clients to always register models before using them, some Stack servers may come up a list of "already known and available" models.
|
|
|
|
What's with the `provider_model_id` field? This is an identifier for the model inside the provider's model catalog. Contrast it with `model_id` which is the identifier for the same model for Llama Stack's purposes. For example, you may want to name "llama3.2:vision-11b" as "image_captioning_model" when you use it in your Stack interactions. When omitted, the server will set `provider_model_id` to be the same as `model_id`.
|
|
|
|
## Server Configuration
|
|
|
|
The `server` section configures the HTTP server that serves the Llama Stack APIs:
|
|
|
|
```yaml
|
|
server:
|
|
port: 8321 # Port to listen on (default: 8321)
|
|
tls_certfile: "/path/to/cert.pem" # Optional: Path to TLS certificate for HTTPS
|
|
tls_keyfile: "/path/to/key.pem" # Optional: Path to TLS key for HTTPS
|
|
auth: # Optional: Authentication configuration
|
|
provider_type: "kubernetes" # Type of auth provider
|
|
config: # Provider-specific configuration
|
|
api_server_url: "https://kubernetes.default.svc"
|
|
ca_cert_path: "/path/to/ca.crt" # Optional: Path to CA certificate
|
|
```
|
|
|
|
### Authentication Configuration
|
|
|
|
The `auth` section configures authentication for the server. When configured, all API requests must include a valid Bearer token in the Authorization header:
|
|
|
|
```
|
|
Authorization: Bearer <token>
|
|
```
|
|
|
|
The server supports multiple authentication providers:
|
|
|
|
#### Kubernetes Provider
|
|
|
|
The Kubernetes cluster must be configured to use a service account for authentication.
|
|
|
|
```bash
|
|
kubectl create namespace llama-stack
|
|
kubectl create serviceaccount llama-stack-auth -n llama-stack
|
|
kubectl create rolebinding llama-stack-auth-rolebinding --clusterrole=admin --serviceaccount=llama-stack:llama-stack-auth -n llama-stack
|
|
kubectl create token llama-stack-auth -n llama-stack > llama-stack-auth-token
|
|
```
|
|
|
|
Validates tokens against the Kubernetes API server:
|
|
```yaml
|
|
server:
|
|
auth:
|
|
provider_type: "kubernetes"
|
|
config:
|
|
api_server_url: "https://kubernetes.default.svc" # URL of the Kubernetes API server
|
|
ca_cert_path: "/path/to/ca.crt" # Optional: Path to CA certificate
|
|
```
|
|
|
|
The provider extracts user information from the JWT token:
|
|
- Username from the `sub` claim becomes a role
|
|
- Kubernetes groups become teams
|
|
|
|
You can easily validate a request by running:
|
|
|
|
```bash
|
|
curl -s -L -H "Authorization: Bearer $(cat llama-stack-auth-token)" http://127.0.0.1:8321/v1/providers
|
|
```
|
|
|
|
#### Custom Provider
|
|
Validates tokens against a custom authentication endpoint:
|
|
```yaml
|
|
server:
|
|
auth:
|
|
provider_type: "custom"
|
|
config:
|
|
endpoint: "https://auth.example.com/validate" # URL of the auth endpoint
|
|
```
|
|
|
|
The custom endpoint receives a POST request with:
|
|
```json
|
|
{
|
|
"api_key": "<token>",
|
|
"request": {
|
|
"path": "/api/v1/endpoint",
|
|
"headers": {
|
|
"content-type": "application/json",
|
|
"user-agent": "curl/7.64.1"
|
|
},
|
|
"params": {
|
|
"key": ["value"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
And must respond with:
|
|
```json
|
|
{
|
|
"access_attributes": {
|
|
"roles": ["admin", "user"],
|
|
"teams": ["ml-team", "nlp-team"],
|
|
"projects": ["llama-3", "project-x"],
|
|
"namespaces": ["research"]
|
|
},
|
|
"message": "Authentication successful"
|
|
}
|
|
```
|
|
|
|
If no access attributes are returned, the token is used as a namespace.
|
|
|
|
### Quota Configuration
|
|
|
|
The `quota` section allows you to enable server-side request throttling for both
|
|
authenticated and anonymous clients. This is useful for preventing abuse, enforcing
|
|
fairness across tenants, and controlling infrastructure costs without requiring
|
|
client-side rate limiting or external proxies.
|
|
|
|
Quotas are disabled by default. When enabled, each client is tracked using either:
|
|
|
|
* Their authenticated `client_id` (derived from the Bearer token), or
|
|
* Their IP address (fallback for anonymous requests)
|
|
|
|
Quota state is stored in a SQLite-backed key-value store, and rate limits are applied
|
|
within a configurable time window (currently only `day` is supported).
|
|
|
|
#### Example
|
|
|
|
```yaml
|
|
server:
|
|
quota:
|
|
kvstore:
|
|
type: sqlite
|
|
db_path: ./quotas.db
|
|
anonymous_max_requests: 100
|
|
authenticated_max_requests: 1000
|
|
period: day
|
|
```
|
|
|
|
#### Configuration Options
|
|
|
|
| Field | Description |
|
|
| ---------------------------- | -------------------------------------------------------------------------- |
|
|
| `kvstore` | Required. Backend storage config for tracking request counts. |
|
|
| `kvstore.type` | Must be `"sqlite"` for now. Other backends may be supported in the future. |
|
|
| `kvstore.db_path` | File path to the SQLite database. |
|
|
| `anonymous_max_requests` | Max requests per period for unauthenticated clients. |
|
|
| `authenticated_max_requests` | Max requests per period for authenticated clients. |
|
|
| `period` | Time window for quota enforcement. Only `"day"` is supported. |
|
|
|
|
> Note: if `authenticated_max_requests` is set but no authentication provider is
|
|
configured, the server will fall back to applying `anonymous_max_requests` to all
|
|
clients.
|
|
|
|
#### Example with Authentication Enabled
|
|
|
|
```yaml
|
|
server:
|
|
port: 8321
|
|
auth:
|
|
provider_type: custom
|
|
config:
|
|
endpoint: https://auth.example.com/validate
|
|
quota:
|
|
kvstore:
|
|
type: sqlite
|
|
db_path: ./quotas.db
|
|
anonymous_max_requests: 100
|
|
authenticated_max_requests: 1000
|
|
period: day
|
|
```
|
|
|
|
If a client exceeds their limit, the server responds with:
|
|
|
|
```http
|
|
HTTP/1.1 429 Too Many Requests
|
|
Content-Type: application/json
|
|
|
|
{
|
|
"error": {
|
|
"message": "Quota exceeded"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Extending to handle Safety
|
|
|
|
Configuring Safety can be a little involved so it is instructive to go through an example.
|
|
|
|
The Safety API works with the associated Resource called a `Shield`. Providers can support various kinds of Shields. Good examples include the [Llama Guard](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/) system-safety models, or [Bedrock Guardrails](https://aws.amazon.com/bedrock/guardrails/).
|
|
|
|
To configure a Bedrock Shield, you would need to add:
|
|
- A Safety API provider instance with type `remote::bedrock`
|
|
- A Shield resource served by this provider.
|
|
|
|
```yaml
|
|
...
|
|
providers:
|
|
safety:
|
|
- provider_id: bedrock
|
|
provider_type: remote::bedrock
|
|
config:
|
|
aws_access_key_id: ${env.AWS_ACCESS_KEY_ID}
|
|
aws_secret_access_key: ${env.AWS_SECRET_ACCESS_KEY}
|
|
...
|
|
shields:
|
|
- provider_id: bedrock
|
|
params:
|
|
guardrailVersion: ${env.GUARDRAIL_VERSION}
|
|
provider_shield_id: ${env.GUARDRAIL_ID}
|
|
...
|
|
```
|
|
|
|
The situation is more involved if the Shield needs _Inference_ of an associated model. This is the case with Llama Guard. In that case, you would need to add:
|
|
- A Safety API provider instance with type `inline::llama-guard`
|
|
- An Inference API provider instance for serving the model.
|
|
- A Model resource associated with this provider.
|
|
- A Shield resource served by the Safety provider.
|
|
|
|
The yaml configuration for this setup, assuming you were using vLLM as your inference server, would look like:
|
|
```yaml
|
|
...
|
|
providers:
|
|
safety:
|
|
- provider_id: llama-guard
|
|
provider_type: inline::llama-guard
|
|
config: {}
|
|
inference:
|
|
# this vLLM server serves the "normal" inference model (e.g., llama3.2:3b)
|
|
- provider_id: vllm-0
|
|
provider_type: remote::vllm
|
|
config:
|
|
url: ${env.VLLM_URL:http://localhost:8000}
|
|
# this vLLM server serves the llama-guard model (e.g., llama-guard:3b)
|
|
- provider_id: vllm-1
|
|
provider_type: remote::vllm
|
|
config:
|
|
url: ${env.SAFETY_VLLM_URL:http://localhost:8001}
|
|
...
|
|
models:
|
|
- metadata: {}
|
|
model_id: ${env.INFERENCE_MODEL}
|
|
provider_id: vllm-0
|
|
provider_model_id: null
|
|
- metadata: {}
|
|
model_id: ${env.SAFETY_MODEL}
|
|
provider_id: vllm-1
|
|
provider_model_id: null
|
|
shields:
|
|
- provider_id: llama-guard
|
|
shield_id: ${env.SAFETY_MODEL} # Llama Guard shields are identified by the corresponding LlamaGuard model
|
|
provider_shield_id: null
|
|
...
|
|
```
|