Merge branch 'main' into eval_api_final

2025-03-26 12:29:45 -07:00 · 2025-03-26 12:29:45 -07:00 · bc0cd07008
commit bc0cd07008
parent 7f12ea290f cb2a9784ab
79 changed files with 3257 additions and 2358 deletions
--- a/docs/source/distributions/kubernetes_deployment.md
+++ b/docs/source/distributions/kubernetes_deployment.md
@ -8,7 +8,7 @@ First, create a local Kubernetes cluster via Kind:
 kind create cluster --image kindest/node:v1.32.0 --name llama-stack-test
 ```

-Start vLLM server as a Kubernetes Pod and Service:
+First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:

 ```bash
 cat <<EOF |kubectl apply -f -
@ -31,7 +31,12 @@ metadata:
 type: Opaque
 data:
  token: $(HF_TOKEN)
---
+```
+
+Next, start the vLLM server as a Kubernetes Deployment and Service:
+
+```bash
+cat <<EOF |kubectl apply -f -
 apiVersion: apps/v1
 kind: Deployment
 metadata:
@ -47,28 +52,23 @@ spec:
        app.kubernetes.io/name: vllm
    spec:
      containers:
-      - name: llama-stack
-        image: $(VLLM_IMAGE)
-        command:
-            - bash
-            - -c
-            - |
-              MODEL="meta-llama/Llama-3.2-1B-Instruct"
-              MODEL_PATH=/app/model/$(basename $MODEL)
-              huggingface-cli login --token $HUGGING_FACE_HUB_TOKEN
-              huggingface-cli download $MODEL --local-dir $MODEL_PATH --cache-dir $MODEL_PATH
-              python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH --served-model-name $MODEL --port 8000
+      - name: vllm
+        image: vllm/vllm-openai:latest
+        command: ["/bin/sh", "-c"]
+        args: [
+          "vllm serve meta-llama/Llama-3.2-1B-Instruct"
+        ]
+        env:
+        - name: HUGGING_FACE_HUB_TOKEN
+          valueFrom:
+            secretKeyRef:
+              name: hf-token-secret
+              key: token
        ports:
          - containerPort: 8000
        volumeMounts:
          - name: llama-storage
-            mountPath: /app/model
-        env:
-          - name: HUGGING_FACE_HUB_TOKEN
-            valueFrom:
-              secretKeyRef:
-                name: hf-token-secret
-                key: token
+            mountPath: /root/.cache/huggingface
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
--- a/docs/source/distributions/remote_hosted_distro/nvidia.md
+++ b/docs/source/distributions/remote_hosted_distro/nvidia.md
@ -8,6 +8,7 @@ The `llamastack/distribution-nvidia` distribution consists of the following prov
 | agents | `inline::meta-reference` |
 | datasetio | `inline::localfs` |
 | inference | `remote::nvidia` |
+| post_training | `remote::nvidia` |
 | safety | `remote::nvidia` |
 | telemetry | `inline::meta-reference` |
 | tool_runtime | `inline::rag-runtime` |
@ -19,6 +20,12 @@ The `llamastack/distribution-nvidia` distribution consists of the following prov
 The following environment variables can be configured:

 - `NVIDIA_API_KEY`: NVIDIA API Key (default: ``)
+- `NVIDIA_USER_ID`: NVIDIA User ID (default: `llama-stack-user`)
+- `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`)
+- `NVIDIA_ACCESS_POLICIES`: NVIDIA Access Policies (default: `{}`)
+- `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`)
+- `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`)
+- `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`)
 - `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`)
 - `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`)
 - `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`)