add readme

2025-08-15 14:08:00 +00:00 · 2025-08-03 14:35:45 -07:00 · 2025-08-03 14:35:45 -07:00 · 62c758932d
commit 62c758932d
parent dcc47c2008
3 changed files with 209 additions and 5 deletions
--- a/docs/source/distributions/k8s/README.md
+++ b/docs/source/distributions/k8s/README.md
@ -0,0 +1,206 @@
+# Llama Stack Kubernetes Deployment Guide
+
+This guide explains how to deploy Llama Stack on Kubernetes using the files in this directory.
+
+## Prerequisites
+
+Before you begin, ensure you have:
+
+- A Kubernetes cluster up and running
+- `kubectl` installed and configured to access your cluster
+- `envsubst` command available (part of the `gettext` package)
+- Hugging Face API token (required for downloading models)
+- NVIDIA NGC API key (required for NIM models)
+For the cluster setup, please do:
+1. Install Kubernetes nvidia operator, this will enable the GPU features:
+```bash
+kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml
+```
+2. Install prometheus and grafana for gpu monitoring following [this guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/kube-prometheus.html).
+
+## Environment Setup
+
+The deployment requires several environment variables to be set:
+
+```bash
+# Required environment variables
+export HF_TOKEN=your_hugging_face_token  # Required for vLLM to download models
+export NGC_API_KEY=your_ngc_api_key      # Required for NIM to download models
+
+# Optional environment variables with defaults
+export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct  # Default inference model
+export CODE_MODEL=bigcode/starcoder2-7b                  # Default code model
+export OLLAMA_MODEL=llama-guard3:1b                      # Default safety model
+export USE_EBS=false                                     # Use EBS storage (true/false)
+export TAVILY_SEARCH_API_KEY=your_tavily_api_key         # Optional for search functionality
+```
+
+## Deployment Steps
+
+1. **Clone the repository** (if you haven't already):
+   ```bash
+   git clone https://github.com/meta-llama/llama-stack.git
+   git checkout k8s_demo
+   cd llama-stack/docs/source/distributions/k8s
+   ```
+
+2. **Deploy the stack**:
+   ```bash
+   export NGC_API_KEY=your_ngc_api_key
+   export HF_TOKEN=your_hugging_face_token
+   ./apply.sh
+   ```
+
+The deployment process:
+1. Creates Kubernetes secrets for authentication
+2. Deploys all components:
+   - vLLM server (inference)
+   - Ollama safety service
+   - Llama NIM (code model)
+   - PostgreSQL database
+   - Chroma vector database
+   - Llama Stack server
+   - UI service
+   - Ingress configuration
+
+## Storage Options
+
+The deployment supports two storage options:
+
+1. **EBS Storage** (persistent):
+   - Set `USE_EBS=true` for persistent storage
+   - Data will persist across pod restarts
+   - Requires EBS CSI driver in your cluster
+
+2. **emptyDir Storage** (non-persistent):
+   - Default option (`USE_EBS=false`)
+   - Data will be lost when pods restart
+   - Useful for testing or when EBS is not available
+
+## Accessing the Services
+
+After deployment, you can access the services:
+
+1. **Check available service endpoint**:
+   ```bash
+   kubectl get svc
+   kubectl get svc -n prometheus
+   ```
+
+2. **Port forward to access locally**:
+   - To access the UI at http://localhost:8322, do:
+   ```bash
+   kubectl port-forward svc/llama-stack-service 8321:8321
+   ```
+   - To use the llama-stack endpoint at http://localhost:8321, do:
+   ```bash
+   kubectl port-forward svc/llama-stack-service 8321:8321 -n prometheus
+   ```
+   - To check the grafana endpoint at http://localhost:31509, do:
+   ```bash
+   kubectl port-forward svc/kube-prometheus-stack-1754164871-grafana 31509:80 -n prometheus
+   ```
+
+
+## Configuration
+
+### Model Configuration
+
+You can customize the models used by change environment variables in `apply.sh`:
+
+```bash
+export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct  # Change to your preferred model
+export CODE_MODEL=bigcode/starcoder2-7b                  # Change to your preferred code model
+export OLLAMA_MODEL=llama-guard3:1b                      # Change to your preferred safety model
+```
+
+### Stack Configuration
+
+The stack configuration is defined in `stack_run_config.yaml`. This file configures:
+- API providers
+- Models
+- Database connections
+- Tool integrations
+
+If you need to modify this configuration, edit the file before running `apply.sh`.
+
+## Monitoring
+
+The deployment includes Prometheus monitoring capabilities:
+
+```bash
+# Install Prometheus monitoring
+./install-prometheus.sh
+```
+
+## Cleanup
+
+To remove all deployed resources:
+
+```bash
+./delete.sh
+```
+
+This will:
+1. Delete all deployments, services, and configmaps
+2. Remove persistent volume claims
+3. Delete secrets
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Secret creation fails**:
+   - Ensure your HF_TOKEN and NGC_API_KEY are correctly set
+   - Check for any existing secrets that might conflict
+
+2. **Pods stuck in pending state**:
+   - Check if your cluster has enough resources
+   - For GPU-based deployments, ensure GPU nodes are available
+
+3. **Models fail to download**:
+   - Verify your HF_TOKEN and NGC_API_KEY are valid
+   - Check pod logs for specific error messages:
+     ```bash
+     kubectl logs -f deployment/vllm-server
+     kubectl logs -f deployment/llm-nim-code
+     ```
+
+4. **Services not accessible**:
+   - Verify all pods are running:
+     ```bash
+     kubectl get pods
+     ```
+   - Check service endpoints:
+     ```bash
+     kubectl get endpoints
+     ```
+
+### Viewing Logs
+
+```bash
+# View logs for specific components
+kubectl logs -f deployment/llama-stack-server
+kubectl logs -f deployment/vllm-server
+kubectl logs -f deployment/llama-stack-ui
+```
+
+## Advanced Configuration
+
+### Custom Resource Limits
+
+You can modify the resource limits in the YAML template files before deployment:
+
+- `vllm-k8s.yaml.template`: vLLM server resources
+- `stack-k8s.yaml.template`: Llama Stack server resources
+- `llama-nim.yaml.template`: NIM server resources
+
+
+
+
+
+## Additional Resources
+
+- [Llama Stack Documentation](https://github.com/meta-llama/llama-stack)
+- [vLLM Documentation](https://docs.vllm.ai/)
+- [Kubernetes Documentation](https://kubernetes.io/docs/)
--- a/docs/source/distributions/k8s/apply.sh
+++ b/docs/source/distributions/k8s/apply.sh
@ -6,11 +6,7 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-# Check if NGC_API_KEY is provided as argument
-if [ -n "$1" ]; then
-  export NGC_API_KEY=$1
-  echo "Using NGC API key provided as argument."
-fi
+# This script is used to apply the Kubernetes resources for the Llama Stack.

 export POSTGRES_USER=llamastack
 export POSTGRES_DB=llamastack
--- a/llama_stack/distribution/ui/page/playground/tools.py
+++ b/llama_stack/distribution/ui/page/playground/tools.py
@ -269,6 +269,8 @@ def tool_chat_page():
                if action and isinstance(action, dict):
                    tool_name = action.get("tool_name")
                    tool_params = action.get("tool_params")
+                    if tool_name.endswith("_search"):
+                        tool_name = "web_search"
                    with st.expander(f'🛠 Action: Using tool "{tool_name}"', expanded=False):
                        st.json(tool_params)