diff --git a/docs/source/distributions/k8s/README.md b/docs/source/distributions/k8s/README.md new file mode 100644 index 000000000..84f43c3f5 --- /dev/null +++ b/docs/source/distributions/k8s/README.md @@ -0,0 +1,206 @@ +# Llama Stack Kubernetes Deployment Guide + +This guide explains how to deploy Llama Stack on Kubernetes using the files in this directory. + +## Prerequisites + +Before you begin, ensure you have: + +- A Kubernetes cluster up and running +- `kubectl` installed and configured to access your cluster +- `envsubst` command available (part of the `gettext` package) +- Hugging Face API token (required for downloading models) +- NVIDIA NGC API key (required for NIM models) +For the cluster setup, please do: +1. Install Kubernetes nvidia operator, this will enable the GPU features: +```bash +kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml +``` +2. Install prometheus and grafana for gpu monitoring following [this guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/kube-prometheus.html). + +## Environment Setup + +The deployment requires several environment variables to be set: + +```bash +# Required environment variables +export HF_TOKEN=your_hugging_face_token # Required for vLLM to download models +export NGC_API_KEY=your_ngc_api_key # Required for NIM to download models + +# Optional environment variables with defaults +export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct # Default inference model +export CODE_MODEL=bigcode/starcoder2-7b # Default code model +export OLLAMA_MODEL=llama-guard3:1b # Default safety model +export USE_EBS=false # Use EBS storage (true/false) +export TAVILY_SEARCH_API_KEY=your_tavily_api_key # Optional for search functionality +``` + +## Deployment Steps + +1. **Clone the repository** (if you haven't already): + ```bash + git clone https://github.com/meta-llama/llama-stack.git + git checkout k8s_demo + cd llama-stack/docs/source/distributions/k8s + ``` + +2. **Deploy the stack**: + ```bash + export NGC_API_KEY=your_ngc_api_key + export HF_TOKEN=your_hugging_face_token + ./apply.sh + ``` + +The deployment process: +1. Creates Kubernetes secrets for authentication +2. Deploys all components: + - vLLM server (inference) + - Ollama safety service + - Llama NIM (code model) + - PostgreSQL database + - Chroma vector database + - Llama Stack server + - UI service + - Ingress configuration + +## Storage Options + +The deployment supports two storage options: + +1. **EBS Storage** (persistent): + - Set `USE_EBS=true` for persistent storage + - Data will persist across pod restarts + - Requires EBS CSI driver in your cluster + +2. **emptyDir Storage** (non-persistent): + - Default option (`USE_EBS=false`) + - Data will be lost when pods restart + - Useful for testing or when EBS is not available + +## Accessing the Services + +After deployment, you can access the services: + +1. **Check available service endpoint**: + ```bash + kubectl get svc + kubectl get svc -n prometheus + ``` + +2. **Port forward to access locally**: + - To access the UI at http://localhost:8322, do: + ```bash + kubectl port-forward svc/llama-stack-service 8321:8321 + ``` + - To use the llama-stack endpoint at http://localhost:8321, do: + ```bash + kubectl port-forward svc/llama-stack-service 8321:8321 -n prometheus + ``` + - To check the grafana endpoint at http://localhost:31509, do: + ```bash + kubectl port-forward svc/kube-prometheus-stack-1754164871-grafana 31509:80 -n prometheus + ``` + + +## Configuration + +### Model Configuration + +You can customize the models used by change environment variables in `apply.sh`: + +```bash +export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct # Change to your preferred model +export CODE_MODEL=bigcode/starcoder2-7b # Change to your preferred code model +export OLLAMA_MODEL=llama-guard3:1b # Change to your preferred safety model +``` + +### Stack Configuration + +The stack configuration is defined in `stack_run_config.yaml`. This file configures: +- API providers +- Models +- Database connections +- Tool integrations + +If you need to modify this configuration, edit the file before running `apply.sh`. + +## Monitoring + +The deployment includes Prometheus monitoring capabilities: + +```bash +# Install Prometheus monitoring +./install-prometheus.sh +``` + +## Cleanup + +To remove all deployed resources: + +```bash +./delete.sh +``` + +This will: +1. Delete all deployments, services, and configmaps +2. Remove persistent volume claims +3. Delete secrets + +## Troubleshooting + +### Common Issues + +1. **Secret creation fails**: + - Ensure your HF_TOKEN and NGC_API_KEY are correctly set + - Check for any existing secrets that might conflict + +2. **Pods stuck in pending state**: + - Check if your cluster has enough resources + - For GPU-based deployments, ensure GPU nodes are available + +3. **Models fail to download**: + - Verify your HF_TOKEN and NGC_API_KEY are valid + - Check pod logs for specific error messages: + ```bash + kubectl logs -f deployment/vllm-server + kubectl logs -f deployment/llm-nim-code + ``` + +4. **Services not accessible**: + - Verify all pods are running: + ```bash + kubectl get pods + ``` + - Check service endpoints: + ```bash + kubectl get endpoints + ``` + +### Viewing Logs + +```bash +# View logs for specific components +kubectl logs -f deployment/llama-stack-server +kubectl logs -f deployment/vllm-server +kubectl logs -f deployment/llama-stack-ui +``` + +## Advanced Configuration + +### Custom Resource Limits + +You can modify the resource limits in the YAML template files before deployment: + +- `vllm-k8s.yaml.template`: vLLM server resources +- `stack-k8s.yaml.template`: Llama Stack server resources +- `llama-nim.yaml.template`: NIM server resources + + + + + +## Additional Resources + +- [Llama Stack Documentation](https://github.com/meta-llama/llama-stack) +- [vLLM Documentation](https://docs.vllm.ai/) +- [Kubernetes Documentation](https://kubernetes.io/docs/) diff --git a/docs/source/distributions/k8s/apply.sh b/docs/source/distributions/k8s/apply.sh index 4d193b496..d541c3d4b 100755 --- a/docs/source/distributions/k8s/apply.sh +++ b/docs/source/distributions/k8s/apply.sh @@ -6,11 +6,7 @@ # This source code is licensed under the terms described in the LICENSE file in # the root directory of this source tree. -# Check if NGC_API_KEY is provided as argument -if [ -n "$1" ]; then - export NGC_API_KEY=$1 - echo "Using NGC API key provided as argument." -fi +# This script is used to apply the Kubernetes resources for the Llama Stack. export POSTGRES_USER=llamastack export POSTGRES_DB=llamastack diff --git a/llama_stack/distribution/ui/page/playground/tools.py b/llama_stack/distribution/ui/page/playground/tools.py index f7e918b32..42b63e567 100644 --- a/llama_stack/distribution/ui/page/playground/tools.py +++ b/llama_stack/distribution/ui/page/playground/tools.py @@ -269,6 +269,8 @@ def tool_chat_page(): if action and isinstance(action, dict): tool_name = action.get("tool_name") tool_params = action.get("tool_params") + if tool_name.endswith("_search"): + tool_name = "web_search" with st.expander(f'🛠 Action: Using tool "{tool_name}"', expanded=False): st.json(tool_params)