Merge branch 'main' into elasticsearch-integration

This commit is contained in:
Enrico Zimuel 2025-11-10 15:31:16 +01:00 committed by GitHub
commit 3d031227e6
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -10,7 +10,7 @@ import TabItem from '@theme/TabItem';
# Kubernetes Deployment Guide # Kubernetes Deployment Guide
Deploy Llama Stack and vLLM servers in a Kubernetes cluster instead of running them locally. This guide covers both local development with Kind and production deployment on AWS EKS. Deploy Llama Stack and vLLM servers in a Kubernetes cluster instead of running them locally. This guide covers deployment using the Kubernetes operator to manage the Llama Stack server with Kind. The vLLM inference server is deployed manually.
## Prerequisites ## Prerequisites
@ -110,115 +110,176 @@ spec:
EOF EOF
``` ```
### Step 3: Configure Llama Stack ### Step 3: Install Kubernetes Operator
Update your run configuration: Install the Llama Stack Kubernetes operator to manage Llama Stack deployments:
```yaml
providers:
inference:
- provider_id: vllm
provider_type: remote::vllm
config:
url: http://vllm-server.default.svc.cluster.local:8000/v1
max_tokens: 4096
api_token: fake
```
Build container image:
```bash ```bash
tmp_dir=$(mktemp -d) && cat >$tmp_dir/Containerfile.llama-stack-run-k8s <<EOF # Install from the latest main branch
FROM distribution-myenv:dev kubectl apply -f https://raw.githubusercontent.com/llamastack/llama-stack-k8s-operator/main/release/operator.yaml
RUN apt-get update && apt-get install -y git
RUN git clone https://github.com/meta-llama/llama-stack.git /app/llama-stack-source # Or install a specific version (e.g., v0.4.0)
ADD ./vllm-llama-stack-run-k8s.yaml /app/config.yaml # kubectl apply -f https://raw.githubusercontent.com/llamastack/llama-stack-k8s-operator/v0.4.0/release/operator.yaml
EOF
podman build -f $tmp_dir/Containerfile.llama-stack-run-k8s -t llama-stack-run-k8s $tmp_dir
``` ```
### Step 4: Deploy Llama Stack Server Verify the operator is running:
```bash
kubectl get pods -n llama-stack-operator-system
```
For more information about the operator, see the [llama-stack-k8s-operator repository](https://github.com/llamastack/llama-stack-k8s-operator).
### Step 4: Deploy Llama Stack Server using Operator
Create a `LlamaStackDistribution` custom resource to deploy the Llama Stack server. The operator will automatically create the necessary Deployment, Service, and other resources.
You can optionally override the default `run.yaml` using `spec.server.userConfig` with a ConfigMap (see [userConfig spec](https://github.com/llamastack/llama-stack-k8s-operator/blob/main/docs/api-overview.md#userconfigspec)).
```yaml ```yaml
cat <<EOF | kubectl apply -f - cat <<EOF | kubectl apply -f -
apiVersion: v1 apiVersion: llamastack.io/v1alpha1
kind: PersistentVolumeClaim kind: LlamaStackDistribution
metadata: metadata:
name: llama-pvc name: llamastack-vllm
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-stack-server
spec: spec:
replicas: 1 replicas: 1
selector: server:
matchLabels: distribution:
app.kubernetes.io/name: llama-stack name: starter
template: containerSpec:
metadata: port: 8321
labels: env:
app.kubernetes.io/name: llama-stack - name: VLLM_URL
spec: value: "http://vllm-server.default.svc.cluster.local:8000/v1"
containers: - name: VLLM_MAX_TOKENS
- name: llama-stack value: "4096"
image: localhost/llama-stack-run-k8s:latest - name: VLLM_API_TOKEN
imagePullPolicy: IfNotPresent value: "fake"
command: ["llama", "stack", "run", "/app/config.yaml"] # Optional: override run.yaml from a ConfigMap using userConfig
ports: userConfig:
- containerPort: 5000 configMap:
volumeMounts: name: llama-stack-config
- name: llama-storage storage:
mountPath: /root/.llama size: "20Gi"
volumes: mountPath: "/home/lls/.lls"
- name: llama-storage
persistentVolumeClaim:
claimName: llama-pvc
---
apiVersion: v1
kind: Service
metadata:
name: llama-stack-service
spec:
selector:
app.kubernetes.io/name: llama-stack
ports:
- protocol: TCP
port: 5000
targetPort: 5000
type: ClusterIP
EOF EOF
``` ```
**Configuration Options:**
- `replicas`: Number of Llama Stack server instances to run
- `server.distribution.name`: The distribution to use (e.g., `starter` for the starter distribution). See the [list of supported distributions](https://github.com/llamastack/llama-stack-k8s-operator/blob/main/distributions.json) in the operator repository.
- `server.distribution.image`: (Optional) Custom container image for non-supported distributions. Use this field when deploying a distribution that is not in the supported list. If specified, this takes precedence over `name`.
- `server.containerSpec.port`: Port on which the Llama Stack server listens (default: 8321)
- `server.containerSpec.env`: Environment variables to configure providers:
- `server.userConfig`: (Optional) Override the default `run.yaml` using a ConfigMap. See [userConfig spec](https://github.com/llamastack/llama-stack-k8s-operator/blob/main/docs/api-overview.md#userconfigspec).
- `server.storage.size`: Size of the persistent volume for model and data storage
- `server.storage.mountPath`: Where to mount the storage in the container
**Note:** For a complete list of supported distributions, see [distributions.json](https://github.com/llamastack/llama-stack-k8s-operator/blob/main/distributions.json) in the operator repository. To use a custom or non-supported distribution, set the `server.distribution.image` field with your container image instead of `server.distribution.name`.
The operator automatically creates:
- A Deployment for the Llama Stack server
- A Service to access the server
- A PersistentVolumeClaim for storage
- All necessary RBAC resources
Check the status of your deployment:
```bash
kubectl get llamastackdistribution
kubectl describe llamastackdistribution llamastack-vllm
```
### Step 5: Test Deployment ### Step 5: Test Deployment
Wait for the Llama Stack server pod to be ready:
```bash ```bash
# Port forward and test # Check the status of the LlamaStackDistribution
kubectl port-forward service/llama-stack-service 5000:5000 kubectl get llamastackdistribution llamastack-vllm
llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"
# Check the pods created by the operator
kubectl get pods -l app.kubernetes.io/name=llama-stack
# Wait for the pod to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=llama-stack --timeout=300s
```
Get the service name created by the operator (it typically follows the pattern `<llamastackdistribution-name>-service`):
```bash
# List services to find the service name
kubectl get services | grep llamastack
# Port forward and test (replace SERVICE_NAME with the actual service name)
kubectl port-forward service/llamastack-vllm-service 8321:8321
```
In another terminal, test the deployment:
```bash
llama-stack-client --endpoint http://localhost:8321 inference chat-completion --message "hello, what model are you?"
``` ```
## Troubleshooting ## Troubleshooting
**Check pod status:** ### vLLM Server Issues
**Check vLLM pod status:**
```bash ```bash
kubectl get pods -l app.kubernetes.io/name=vllm kubectl get pods -l app.kubernetes.io/name=vllm
kubectl logs -l app.kubernetes.io/name=vllm kubectl logs -l app.kubernetes.io/name=vllm
``` ```
**Test service connectivity:** **Test vLLM service connectivity:**
```bash ```bash
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl http://vllm-server:8000/v1/models kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl http://vllm-server:8000/v1/models
``` ```
### Llama Stack Server Issues
**Check LlamaStackDistribution status:**
```bash
# Get detailed status
kubectl describe llamastackdistribution llamastack-vllm
# Check for events
kubectl get events --sort-by='.lastTimestamp' | grep llamastack-vllm
```
**Check operator-managed pods:**
```bash
# List all pods managed by the operator
kubectl get pods -l app.kubernetes.io/name=llama-stack
# Check pod logs (replace POD_NAME with actual pod name)
kubectl logs -l app.kubernetes.io/name=llama-stack
```
**Check operator status:**
```bash
# Verify the operator is running
kubectl get pods -n llama-stack-operator-system
# Check operator logs if issues persist
kubectl logs -n llama-stack-operator-system -l control-plane=controller-manager
```
**Verify service connectivity:**
```bash
# Get the service endpoint
kubectl get svc llamastack-vllm-service
# Test connectivity from within the cluster
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- curl http://llamastack-vllm-service:8321/health
```
## Related Resources ## Related Resources
- **[Deployment Overview](/docs/deploying/)** - Overview of deployment options - **[Deployment Overview](/docs/deploying/)** - Overview of deployment options
- **[Distributions](/docs/distributions)** - Understanding Llama Stack distributions - **[Distributions](/docs/distributions)** - Understanding Llama Stack distributions
- **[Configuration](/docs/distributions/configuration)** - Detailed configuration options - **[Configuration](/docs/distributions/configuration)** - Detailed configuration options
- **[LlamaStack Operator](https://github.com/llamastack/llama-stack-k8s-operator)** - Overview of llama-stack kubernetes operator
- **[LlamaStackDistribution](https://github.com/llamastack/llama-stack-k8s-operator/blob/main/docs/api-overview.md)** - API Spec of the llama-stack operator Custom Resource.