llama-stack/docs/source/distributions/configuration.md
Sébastien Han 79851d93aa
feat: Add Kubernetes authentication (#1778)
# What does this PR do?

This commit adds a new authentication system to the Llama Stack server
with support for Kubernetes and custom authentication providers. Key
changes include:

- Implemented KubernetesAuthProvider for validating Kubernetes service
account tokens
- Implemented CustomAuthProvider for validating tokens against external
endpoints - this is the same code that was already present.
- Added test for Kubernetes
- Updated server configuration to support authentication settings
- Added documentation for authentication configuration and usage

The authentication system supports:
- Bearer token validation
- Kubernetes service account token validation
- Custom authentication endpoints

## Test Plan

Setup a Kube cluster using Kind or Minikube.

Run a server with:

```
server:
  port: 8321
  auth:
    provider_type: kubernetes
    config:
      api_server_url: http://url
      ca_cert_path: path/to/cert (optional)
```

Run:

```
curl -s -L -H "Authorization: Bearer $(kubectl create token my-user)" http://127.0.0.1:8321/v1/providers
```

Or replace "my-user" with your service account.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-28 22:24:58 +02:00

9.1 KiB

Configuring a "Stack"

The Llama Stack runtime configuration is specified as a YAML file. Here is a simplified version of an example configuration file for the Ollama distribution:


```yaml
version: 2
conda_env: ollama
apis:
- agents
- inference
- vector_io
- safety
- telemetry
providers:
  inference:
  - provider_id: ollama
    provider_type: remote::ollama
    config:
      url: ${env.OLLAMA_URL:http://localhost:11434}
  vector_io:
  - provider_id: faiss
    provider_type: inline::faiss
    config:
      kvstore:
        type: sqlite
        namespace: null
        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/faiss_store.db
  safety:
  - provider_id: llama-guard
    provider_type: inline::llama-guard
    config: {}
  agents:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config:
      persistence_store:
        type: sqlite
        namespace: null
        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/agents_store.db
  telemetry:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config: {}
metadata_store:
  namespace: null
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/registry.db
models:
- metadata: {}
  model_id: ${env.INFERENCE_MODEL}
  provider_id: ollama
  provider_model_id: null
shields: []
server:
  port: 8321
  auth:
    provider_type: "kubernetes"
    config:
      api_server_url: "https://kubernetes.default.svc"
      ca_cert_path: "/path/to/ca.crt"

Let's break this down into the different sections. The first section specifies the set of APIs that the stack server will serve:

apis:
- agents
- inference
- memory
- safety
- telemetry

Providers

Next up is the most critical part: the set of providers that the stack will use to serve the above APIs. Consider the inference API:

providers:
  inference:
  # provider_id is a string you can choose freely
  - provider_id: ollama
    # provider_type is a string that specifies the type of provider.
    # in this case, the provider for inference is ollama and it is run remotely (outside of the distribution)
    provider_type: remote::ollama
    # config is a dictionary that contains the configuration for the provider.
    # in this case, the configuration is the url of the ollama server
    config:
      url: ${env.OLLAMA_URL:http://localhost:11434}

A few things to note:

  • A provider instance is identified with an (id, type, configuration) triplet.
  • The id is a string you can choose freely.
  • You can instantiate any number of provider instances of the same type.
  • The configuration dictionary is provider-specific.
  • Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via llama stack run), you can specify --env OLLAMA_URL=http://my-server:11434 to override the default value.

Resources

Finally, let's look at the models section:

models:
- metadata: {}
  model_id: ${env.INFERENCE_MODEL}
  provider_id: ollama
  provider_model_id: null

A Model is an instance of a "Resource" (see Concepts) and is associated with a specific inference provider (in this case, the provider with identifier ollama). This is an instance of a "pre-registered" model. While we always encourage the clients to always register models before using them, some Stack servers may come up a list of "already known and available" models.

What's with the provider_model_id field? This is an identifier for the model inside the provider's model catalog. Contrast it with model_id which is the identifier for the same model for Llama Stack's purposes. For example, you may want to name "llama3.2:vision-11b" as "image_captioning_model" when you use it in your Stack interactions. When omitted, the server will set provider_model_id to be the same as model_id.

Server Configuration

The server section configures the HTTP server that serves the Llama Stack APIs:

server:
  port: 8321  # Port to listen on (default: 8321)
  tls_certfile: "/path/to/cert.pem"  # Optional: Path to TLS certificate for HTTPS
  tls_keyfile: "/path/to/key.pem"    # Optional: Path to TLS key for HTTPS
  auth:                              # Optional: Authentication configuration
    provider_type: "kubernetes"      # Type of auth provider
    config:                          # Provider-specific configuration
      api_server_url: "https://kubernetes.default.svc"
      ca_cert_path: "/path/to/ca.crt" # Optional: Path to CA certificate

Authentication Configuration

The auth section configures authentication for the server. When configured, all API requests must include a valid Bearer token in the Authorization header:

Authorization: Bearer <token>

The server supports multiple authentication providers:

Kubernetes Provider

The Kubernetes cluster must be configured to use a service account for authentication.

kubectl create namespace llama-stack
kubectl create serviceaccount llama-stack-auth -n llama-stack
kubectl create rolebinding llama-stack-auth-rolebinding --clusterrole=admin --serviceaccount=llama-stack:llama-stack-auth -n llama-stack
kubectl create token llama-stack-auth -n llama-stack > llama-stack-auth-token

Validates tokens against the Kubernetes API server:

server:
  auth:
    provider_type: "kubernetes"
    config:
      api_server_url: "https://kubernetes.default.svc"  # URL of the Kubernetes API server
      ca_cert_path: "/path/to/ca.crt"                   # Optional: Path to CA certificate

The provider extracts user information from the JWT token:

  • Username from the sub claim becomes a role
  • Kubernetes groups become teams

You can easily validate a request by running:

curl -s -L -H "Authorization: Bearer $(cat llama-stack-auth-token)" http://127.0.0.1:8321/v1/providers

Custom Provider

Validates tokens against a custom authentication endpoint:

server:
  auth:
    provider_type: "custom"
    config:
      endpoint: "https://auth.example.com/validate"  # URL of the auth endpoint

The custom endpoint receives a POST request with:

{
  "api_key": "<token>",
  "request": {
    "path": "/api/v1/endpoint",
    "headers": {
      "content-type": "application/json",
      "user-agent": "curl/7.64.1"
    },
    "params": {
      "key": ["value"]
    }
  }
}

And must respond with:

{
  "access_attributes": {
    "roles": ["admin", "user"],
    "teams": ["ml-team", "nlp-team"],
    "projects": ["llama-3", "project-x"],
    "namespaces": ["research"]
  },
  "message": "Authentication successful"
}

If no access attributes are returned, the token is used as a namespace.

Extending to handle Safety

Configuring Safety can be a little involved so it is instructive to go through an example.

The Safety API works with the associated Resource called a Shield. Providers can support various kinds of Shields. Good examples include the Llama Guard system-safety models, or Bedrock Guardrails.

To configure a Bedrock Shield, you would need to add:

  • A Safety API provider instance with type remote::bedrock
  • A Shield resource served by this provider.
...
providers:
  safety:
  - provider_id: bedrock
    provider_type: remote::bedrock
    config:
      aws_access_key_id: ${env.AWS_ACCESS_KEY_ID}
      aws_secret_access_key: ${env.AWS_SECRET_ACCESS_KEY}
...
shields:
- provider_id: bedrock
  params:
    guardrailVersion: ${env.GUARDRAIL_VERSION}
  provider_shield_id: ${env.GUARDRAIL_ID}
...

The situation is more involved if the Shield needs Inference of an associated model. This is the case with Llama Guard. In that case, you would need to add:

  • A Safety API provider instance with type inline::llama-guard
  • An Inference API provider instance for serving the model.
  • A Model resource associated with this provider.
  • A Shield resource served by the Safety provider.

The yaml configuration for this setup, assuming you were using vLLM as your inference server, would look like:

...
providers:
  safety:
  - provider_id: llama-guard
    provider_type: inline::llama-guard
    config: {}
  inference:
  # this vLLM server serves the "normal" inference model (e.g., llama3.2:3b)
  - provider_id: vllm-0
    provider_type: remote::vllm
    config:
      url: ${env.VLLM_URL:http://localhost:8000}
  # this vLLM server serves the llama-guard model (e.g., llama-guard:3b)
  - provider_id: vllm-1
    provider_type: remote::vllm
    config:
      url: ${env.SAFETY_VLLM_URL:http://localhost:8001}
...
models:
- metadata: {}
  model_id: ${env.INFERENCE_MODEL}
  provider_id: vllm-0
  provider_model_id: null
- metadata: {}
  model_id: ${env.SAFETY_MODEL}
  provider_id: vllm-1
  provider_model_id: null
shields:
- provider_id: llama-guard
  shield_id: ${env.SAFETY_MODEL}   # Llama Guard shields are identified by the corresponding LlamaGuard model
  provider_shield_id: null
...