mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-08-15 14:08:00 +00:00
add readme
This commit is contained in:
parent
dcc47c2008
commit
62c758932d
3 changed files with 209 additions and 5 deletions
206
docs/source/distributions/k8s/README.md
Normal file
206
docs/source/distributions/k8s/README.md
Normal file
|
@ -0,0 +1,206 @@
|
||||||
|
# Llama Stack Kubernetes Deployment Guide
|
||||||
|
|
||||||
|
This guide explains how to deploy Llama Stack on Kubernetes using the files in this directory.
|
||||||
|
|
||||||
|
## Prerequisites
|
||||||
|
|
||||||
|
Before you begin, ensure you have:
|
||||||
|
|
||||||
|
- A Kubernetes cluster up and running
|
||||||
|
- `kubectl` installed and configured to access your cluster
|
||||||
|
- `envsubst` command available (part of the `gettext` package)
|
||||||
|
- Hugging Face API token (required for downloading models)
|
||||||
|
- NVIDIA NGC API key (required for NIM models)
|
||||||
|
For the cluster setup, please do:
|
||||||
|
1. Install Kubernetes nvidia operator, this will enable the GPU features:
|
||||||
|
```bash
|
||||||
|
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml
|
||||||
|
```
|
||||||
|
2. Install prometheus and grafana for gpu monitoring following [this guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/kube-prometheus.html).
|
||||||
|
|
||||||
|
## Environment Setup
|
||||||
|
|
||||||
|
The deployment requires several environment variables to be set:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Required environment variables
|
||||||
|
export HF_TOKEN=your_hugging_face_token # Required for vLLM to download models
|
||||||
|
export NGC_API_KEY=your_ngc_api_key # Required for NIM to download models
|
||||||
|
|
||||||
|
# Optional environment variables with defaults
|
||||||
|
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct # Default inference model
|
||||||
|
export CODE_MODEL=bigcode/starcoder2-7b # Default code model
|
||||||
|
export OLLAMA_MODEL=llama-guard3:1b # Default safety model
|
||||||
|
export USE_EBS=false # Use EBS storage (true/false)
|
||||||
|
export TAVILY_SEARCH_API_KEY=your_tavily_api_key # Optional for search functionality
|
||||||
|
```
|
||||||
|
|
||||||
|
## Deployment Steps
|
||||||
|
|
||||||
|
1. **Clone the repository** (if you haven't already):
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/meta-llama/llama-stack.git
|
||||||
|
git checkout k8s_demo
|
||||||
|
cd llama-stack/docs/source/distributions/k8s
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Deploy the stack**:
|
||||||
|
```bash
|
||||||
|
export NGC_API_KEY=your_ngc_api_key
|
||||||
|
export HF_TOKEN=your_hugging_face_token
|
||||||
|
./apply.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
The deployment process:
|
||||||
|
1. Creates Kubernetes secrets for authentication
|
||||||
|
2. Deploys all components:
|
||||||
|
- vLLM server (inference)
|
||||||
|
- Ollama safety service
|
||||||
|
- Llama NIM (code model)
|
||||||
|
- PostgreSQL database
|
||||||
|
- Chroma vector database
|
||||||
|
- Llama Stack server
|
||||||
|
- UI service
|
||||||
|
- Ingress configuration
|
||||||
|
|
||||||
|
## Storage Options
|
||||||
|
|
||||||
|
The deployment supports two storage options:
|
||||||
|
|
||||||
|
1. **EBS Storage** (persistent):
|
||||||
|
- Set `USE_EBS=true` for persistent storage
|
||||||
|
- Data will persist across pod restarts
|
||||||
|
- Requires EBS CSI driver in your cluster
|
||||||
|
|
||||||
|
2. **emptyDir Storage** (non-persistent):
|
||||||
|
- Default option (`USE_EBS=false`)
|
||||||
|
- Data will be lost when pods restart
|
||||||
|
- Useful for testing or when EBS is not available
|
||||||
|
|
||||||
|
## Accessing the Services
|
||||||
|
|
||||||
|
After deployment, you can access the services:
|
||||||
|
|
||||||
|
1. **Check available service endpoint**:
|
||||||
|
```bash
|
||||||
|
kubectl get svc
|
||||||
|
kubectl get svc -n prometheus
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Port forward to access locally**:
|
||||||
|
- To access the UI at http://localhost:8322, do:
|
||||||
|
```bash
|
||||||
|
kubectl port-forward svc/llama-stack-service 8321:8321
|
||||||
|
```
|
||||||
|
- To use the llama-stack endpoint at http://localhost:8321, do:
|
||||||
|
```bash
|
||||||
|
kubectl port-forward svc/llama-stack-service 8321:8321 -n prometheus
|
||||||
|
```
|
||||||
|
- To check the grafana endpoint at http://localhost:31509, do:
|
||||||
|
```bash
|
||||||
|
kubectl port-forward svc/kube-prometheus-stack-1754164871-grafana 31509:80 -n prometheus
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
### Model Configuration
|
||||||
|
|
||||||
|
You can customize the models used by change environment variables in `apply.sh`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct # Change to your preferred model
|
||||||
|
export CODE_MODEL=bigcode/starcoder2-7b # Change to your preferred code model
|
||||||
|
export OLLAMA_MODEL=llama-guard3:1b # Change to your preferred safety model
|
||||||
|
```
|
||||||
|
|
||||||
|
### Stack Configuration
|
||||||
|
|
||||||
|
The stack configuration is defined in `stack_run_config.yaml`. This file configures:
|
||||||
|
- API providers
|
||||||
|
- Models
|
||||||
|
- Database connections
|
||||||
|
- Tool integrations
|
||||||
|
|
||||||
|
If you need to modify this configuration, edit the file before running `apply.sh`.
|
||||||
|
|
||||||
|
## Monitoring
|
||||||
|
|
||||||
|
The deployment includes Prometheus monitoring capabilities:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install Prometheus monitoring
|
||||||
|
./install-prometheus.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
## Cleanup
|
||||||
|
|
||||||
|
To remove all deployed resources:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./delete.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
This will:
|
||||||
|
1. Delete all deployments, services, and configmaps
|
||||||
|
2. Remove persistent volume claims
|
||||||
|
3. Delete secrets
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Common Issues
|
||||||
|
|
||||||
|
1. **Secret creation fails**:
|
||||||
|
- Ensure your HF_TOKEN and NGC_API_KEY are correctly set
|
||||||
|
- Check for any existing secrets that might conflict
|
||||||
|
|
||||||
|
2. **Pods stuck in pending state**:
|
||||||
|
- Check if your cluster has enough resources
|
||||||
|
- For GPU-based deployments, ensure GPU nodes are available
|
||||||
|
|
||||||
|
3. **Models fail to download**:
|
||||||
|
- Verify your HF_TOKEN and NGC_API_KEY are valid
|
||||||
|
- Check pod logs for specific error messages:
|
||||||
|
```bash
|
||||||
|
kubectl logs -f deployment/vllm-server
|
||||||
|
kubectl logs -f deployment/llm-nim-code
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Services not accessible**:
|
||||||
|
- Verify all pods are running:
|
||||||
|
```bash
|
||||||
|
kubectl get pods
|
||||||
|
```
|
||||||
|
- Check service endpoints:
|
||||||
|
```bash
|
||||||
|
kubectl get endpoints
|
||||||
|
```
|
||||||
|
|
||||||
|
### Viewing Logs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# View logs for specific components
|
||||||
|
kubectl logs -f deployment/llama-stack-server
|
||||||
|
kubectl logs -f deployment/vllm-server
|
||||||
|
kubectl logs -f deployment/llama-stack-ui
|
||||||
|
```
|
||||||
|
|
||||||
|
## Advanced Configuration
|
||||||
|
|
||||||
|
### Custom Resource Limits
|
||||||
|
|
||||||
|
You can modify the resource limits in the YAML template files before deployment:
|
||||||
|
|
||||||
|
- `vllm-k8s.yaml.template`: vLLM server resources
|
||||||
|
- `stack-k8s.yaml.template`: Llama Stack server resources
|
||||||
|
- `llama-nim.yaml.template`: NIM server resources
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Additional Resources
|
||||||
|
|
||||||
|
- [Llama Stack Documentation](https://github.com/meta-llama/llama-stack)
|
||||||
|
- [vLLM Documentation](https://docs.vllm.ai/)
|
||||||
|
- [Kubernetes Documentation](https://kubernetes.io/docs/)
|
|
@ -6,11 +6,7 @@
|
||||||
# This source code is licensed under the terms described in the LICENSE file in
|
# This source code is licensed under the terms described in the LICENSE file in
|
||||||
# the root directory of this source tree.
|
# the root directory of this source tree.
|
||||||
|
|
||||||
# Check if NGC_API_KEY is provided as argument
|
# This script is used to apply the Kubernetes resources for the Llama Stack.
|
||||||
if [ -n "$1" ]; then
|
|
||||||
export NGC_API_KEY=$1
|
|
||||||
echo "Using NGC API key provided as argument."
|
|
||||||
fi
|
|
||||||
|
|
||||||
export POSTGRES_USER=llamastack
|
export POSTGRES_USER=llamastack
|
||||||
export POSTGRES_DB=llamastack
|
export POSTGRES_DB=llamastack
|
||||||
|
|
|
@ -269,6 +269,8 @@ def tool_chat_page():
|
||||||
if action and isinstance(action, dict):
|
if action and isinstance(action, dict):
|
||||||
tool_name = action.get("tool_name")
|
tool_name = action.get("tool_name")
|
||||||
tool_params = action.get("tool_params")
|
tool_params = action.get("tool_params")
|
||||||
|
if tool_name.endswith("_search"):
|
||||||
|
tool_name = "web_search"
|
||||||
with st.expander(f'🛠 Action: Using tool "{tool_name}"', expanded=False):
|
with st.expander(f'🛠 Action: Using tool "{tool_name}"', expanded=False):
|
||||||
st.json(tool_params)
|
st.json(tool_params)
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue