mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-07-14 09:06:10 +00:00
fix: update k8s templates (#2645)
Some checks failed
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 5s
Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 9s
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 18s
Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 12s
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 16s
Integration Tests / test-matrix (library, 3.13, agents) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 16s
Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 12s
Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 14s
Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 22s
Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 13s
Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 12s
Integration Tests / test-matrix (library, 3.13, datasets) (push) Failing after 11s
Integration Tests / test-matrix (library, 3.13, scoring) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.13, inference) (push) Failing after 11s
Integration Tests / test-matrix (library, 3.13, post_training) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.13, inspect) (push) Failing after 9s
Integration Tests / test-matrix (server, 3.12, inspect) (push) Failing after 10s
Integration Tests / test-matrix (server, 3.12, agents) (push) Failing after 14s
Integration Tests / test-matrix (server, 3.12, providers) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.13, providers) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.13, tool_runtime) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.13, vector_io) (push) Failing after 11s
Integration Tests / test-matrix (server, 3.12, inference) (push) Failing after 13s
Integration Tests / test-matrix (server, 3.12, tool_runtime) (push) Failing after 10s
Integration Tests / test-matrix (server, 3.12, datasets) (push) Failing after 9s
Integration Tests / test-matrix (server, 3.12, vector_io) (push) Failing after 12s
Integration Tests / test-matrix (server, 3.12, post_training) (push) Failing after 12s
Integration Tests / test-matrix (server, 3.13, inspect) (push) Failing after 15s
Integration Tests / test-matrix (server, 3.12, scoring) (push) Failing after 13s
Integration Tests / test-matrix (server, 3.13, datasets) (push) Failing after 17s
Integration Tests / test-matrix (server, 3.13, providers) (push) Failing after 11s
Integration Tests / test-matrix (server, 3.13, agents) (push) Failing after 12s
Integration Tests / test-matrix (server, 3.13, inference) (push) Failing after 14s
Integration Tests / test-matrix (server, 3.13, post_training) (push) Failing after 10s
Integration Tests / test-matrix (server, 3.13, tool_runtime) (push) Failing after 13s
Integration Tests / test-matrix (server, 3.13, scoring) (push) Failing after 15s
Integration Tests / test-matrix (server, 3.13, vector_io) (push) Failing after 11s
Vector IO Integration Tests / test-matrix (3.12, inline::faiss) (push) Failing after 12s
Vector IO Integration Tests / test-matrix (3.12, inline::milvus) (push) Failing after 13s
Vector IO Integration Tests / test-matrix (3.12, inline::sqlite-vec) (push) Failing after 8s
Vector IO Integration Tests / test-matrix (3.12, remote::pgvector) (push) Failing after 9s
Vector IO Integration Tests / test-matrix (3.12, remote::chromadb) (push) Failing after 11s
Vector IO Integration Tests / test-matrix (3.13, inline::faiss) (push) Failing after 11s
Vector IO Integration Tests / test-matrix (3.13, inline::milvus) (push) Failing after 11s
Vector IO Integration Tests / test-matrix (3.13, inline::sqlite-vec) (push) Failing after 15s
Python Package Build Test / build (3.12) (push) Failing after 33s
Vector IO Integration Tests / test-matrix (3.13, remote::chromadb) (push) Failing after 41s
Vector IO Integration Tests / test-matrix (3.13, remote::pgvector) (push) Failing after 40s
Python Package Build Test / build (3.13) (push) Failing after 33s
Test External Providers / test-external-providers (venv) (push) Failing after 8s
Update ReadTheDocs / update-readthedocs (push) Failing after 10s
Unit Tests / unit-tests (3.12) (push) Failing after 14s
Unit Tests / unit-tests (3.13) (push) Failing after 12s
Pre-commit / pre-commit (push) Successful in 1m23s
Some checks failed
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 5s
Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 9s
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 18s
Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 12s
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 16s
Integration Tests / test-matrix (library, 3.13, agents) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 16s
Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 12s
Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 14s
Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 22s
Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 13s
Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 12s
Integration Tests / test-matrix (library, 3.13, datasets) (push) Failing after 11s
Integration Tests / test-matrix (library, 3.13, scoring) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.13, inference) (push) Failing after 11s
Integration Tests / test-matrix (library, 3.13, post_training) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.13, inspect) (push) Failing after 9s
Integration Tests / test-matrix (server, 3.12, inspect) (push) Failing after 10s
Integration Tests / test-matrix (server, 3.12, agents) (push) Failing after 14s
Integration Tests / test-matrix (server, 3.12, providers) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.13, providers) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.13, tool_runtime) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.13, vector_io) (push) Failing after 11s
Integration Tests / test-matrix (server, 3.12, inference) (push) Failing after 13s
Integration Tests / test-matrix (server, 3.12, tool_runtime) (push) Failing after 10s
Integration Tests / test-matrix (server, 3.12, datasets) (push) Failing after 9s
Integration Tests / test-matrix (server, 3.12, vector_io) (push) Failing after 12s
Integration Tests / test-matrix (server, 3.12, post_training) (push) Failing after 12s
Integration Tests / test-matrix (server, 3.13, inspect) (push) Failing after 15s
Integration Tests / test-matrix (server, 3.12, scoring) (push) Failing after 13s
Integration Tests / test-matrix (server, 3.13, datasets) (push) Failing after 17s
Integration Tests / test-matrix (server, 3.13, providers) (push) Failing after 11s
Integration Tests / test-matrix (server, 3.13, agents) (push) Failing after 12s
Integration Tests / test-matrix (server, 3.13, inference) (push) Failing after 14s
Integration Tests / test-matrix (server, 3.13, post_training) (push) Failing after 10s
Integration Tests / test-matrix (server, 3.13, tool_runtime) (push) Failing after 13s
Integration Tests / test-matrix (server, 3.13, scoring) (push) Failing after 15s
Integration Tests / test-matrix (server, 3.13, vector_io) (push) Failing after 11s
Vector IO Integration Tests / test-matrix (3.12, inline::faiss) (push) Failing after 12s
Vector IO Integration Tests / test-matrix (3.12, inline::milvus) (push) Failing after 13s
Vector IO Integration Tests / test-matrix (3.12, inline::sqlite-vec) (push) Failing after 8s
Vector IO Integration Tests / test-matrix (3.12, remote::pgvector) (push) Failing after 9s
Vector IO Integration Tests / test-matrix (3.12, remote::chromadb) (push) Failing after 11s
Vector IO Integration Tests / test-matrix (3.13, inline::faiss) (push) Failing after 11s
Vector IO Integration Tests / test-matrix (3.13, inline::milvus) (push) Failing after 11s
Vector IO Integration Tests / test-matrix (3.13, inline::sqlite-vec) (push) Failing after 15s
Python Package Build Test / build (3.12) (push) Failing after 33s
Vector IO Integration Tests / test-matrix (3.13, remote::chromadb) (push) Failing after 41s
Vector IO Integration Tests / test-matrix (3.13, remote::pgvector) (push) Failing after 40s
Python Package Build Test / build (3.13) (push) Failing after 33s
Test External Providers / test-external-providers (venv) (push) Failing after 8s
Update ReadTheDocs / update-readthedocs (push) Failing after 10s
Unit Tests / unit-tests (3.12) (push) Failing after 14s
Unit Tests / unit-tests (3.13) (push) Failing after 12s
Pre-commit / pre-commit (push) Successful in 1m23s
# What does this PR do? - fix env variables - use gpu for vllm - add eks/apply.py for aws - add template to set hf secret ## Test Plan bash apply.sh Co-authored-by: Eric Huang <erichuang@fb.com>
This commit is contained in:
parent
daf660c4ea
commit
84fa83b788
9 changed files with 100 additions and 32 deletions
19
docs/source/distributions/eks/apply.sh
Executable file
19
docs/source/distributions/eks/apply.sh
Executable file
|
@ -0,0 +1,19 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
|
||||||
|
# Copyright (c) Meta Platforms, Inc. and affiliates.
|
||||||
|
# All rights reserved.
|
||||||
|
#
|
||||||
|
# This source code is licensed under the terms described in the LICENSE file in
|
||||||
|
# the root directory of this source tree.
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
|
||||||
|
K8S_DIR="${SCRIPT_DIR}/../k8s"
|
||||||
|
|
||||||
|
echo "Setting up AWS EKS-specific storage class..."
|
||||||
|
kubectl apply -f gp3-topology-aware.yaml
|
||||||
|
|
||||||
|
echo "Running main Kubernetes deployment..."
|
||||||
|
cd "${K8S_DIR}"
|
||||||
|
./apply.sh "$@"
|
15
docs/source/distributions/eks/gp3-topology-aware.yaml
Normal file
15
docs/source/distributions/eks/gp3-topology-aware.yaml
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
# Set up default storage class on AWS EKS
|
||||||
|
apiVersion: storage.k8s.io/v1
|
||||||
|
kind: StorageClass
|
||||||
|
metadata:
|
||||||
|
name: gp3-topology-aware
|
||||||
|
annotations:
|
||||||
|
storageclass.kubernetes.io/is-default-class: "true"
|
||||||
|
parameters:
|
||||||
|
type: gp3
|
||||||
|
iops: "3000"
|
||||||
|
throughput: "125"
|
||||||
|
provisioner: ebs.csi.aws.com
|
||||||
|
reclaimPolicy: Delete
|
||||||
|
volumeBindingMode: WaitForFirstConsumer
|
||||||
|
allowVolumeExpansion: true
|
|
@ -13,9 +13,22 @@ export POSTGRES_PASSWORD=${POSTGRES_PASSWORD:-llamastack}
|
||||||
export INFERENCE_MODEL=${INFERENCE_MODEL:-meta-llama/Llama-3.2-3B-Instruct}
|
export INFERENCE_MODEL=${INFERENCE_MODEL:-meta-llama/Llama-3.2-3B-Instruct}
|
||||||
export SAFETY_MODEL=${SAFETY_MODEL:-meta-llama/Llama-Guard-3-1B}
|
export SAFETY_MODEL=${SAFETY_MODEL:-meta-llama/Llama-Guard-3-1B}
|
||||||
|
|
||||||
|
# HF_TOKEN should be set by the user; base64 encode it for the secret
|
||||||
|
if [ -n "${HF_TOKEN:-}" ]; then
|
||||||
|
export HF_TOKEN_BASE64=$(echo -n "$HF_TOKEN" | base64)
|
||||||
|
else
|
||||||
|
echo "ERROR: HF_TOKEN not set. You need it for vLLM to download models from Hugging Face."
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
set -euo pipefail
|
set -euo pipefail
|
||||||
set -x
|
set -x
|
||||||
|
|
||||||
|
# Apply the HF token secret if HF_TOKEN is provided
|
||||||
|
if [ -n "${HF_TOKEN:-}" ]; then
|
||||||
|
envsubst < ./hf-token-secret.yaml.template | kubectl apply -f -
|
||||||
|
fi
|
||||||
|
|
||||||
envsubst < ./vllm-k8s.yaml.template | kubectl apply -f -
|
envsubst < ./vllm-k8s.yaml.template | kubectl apply -f -
|
||||||
envsubst < ./vllm-safety-k8s.yaml.template | kubectl apply -f -
|
envsubst < ./vllm-safety-k8s.yaml.template | kubectl apply -f -
|
||||||
envsubst < ./postgres-k8s.yaml.template | kubectl apply -f -
|
envsubst < ./postgres-k8s.yaml.template | kubectl apply -f -
|
||||||
|
|
|
@ -0,0 +1,7 @@
|
||||||
|
apiVersion: v1
|
||||||
|
kind: Secret
|
||||||
|
metadata:
|
||||||
|
name: hf-token-secret
|
||||||
|
type: Opaque
|
||||||
|
data:
|
||||||
|
token: ${HF_TOKEN_BASE64}
|
|
@ -22,10 +22,10 @@ data:
|
||||||
- provider_id: vllm-safety
|
- provider_id: vllm-safety
|
||||||
provider_type: remote::vllm
|
provider_type: remote::vllm
|
||||||
config:
|
config:
|
||||||
url: ${env.VLLM_SAFETY_URL:http://localhost:8000/v1}
|
url: ${env.VLLM_SAFETY_URL:=http://localhost:8000/v1}
|
||||||
max_tokens: ${env.VLLM_MAX_TOKENS:4096}
|
max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
|
||||||
api_token: ${env.VLLM_API_TOKEN:fake}
|
api_token: ${env.VLLM_API_TOKEN:=fake}
|
||||||
tls_verify: ${env.VLLM_TLS_VERIFY:true}
|
tls_verify: ${env.VLLM_TLS_VERIFY:=true}
|
||||||
- provider_id: sentence-transformers
|
- provider_id: sentence-transformers
|
||||||
provider_type: inline::sentence-transformers
|
provider_type: inline::sentence-transformers
|
||||||
config: {}
|
config: {}
|
||||||
|
@ -33,7 +33,7 @@ data:
|
||||||
- provider_id: ${env.ENABLE_CHROMADB:+chromadb}
|
- provider_id: ${env.ENABLE_CHROMADB:+chromadb}
|
||||||
provider_type: remote::chromadb
|
provider_type: remote::chromadb
|
||||||
config:
|
config:
|
||||||
url: ${env.CHROMADB_URL:+}
|
url: ${env.CHROMADB_URL:=}
|
||||||
safety:
|
safety:
|
||||||
- provider_id: llama-guard
|
- provider_id: llama-guard
|
||||||
provider_type: inline::llama-guard
|
provider_type: inline::llama-guard
|
||||||
|
@ -48,7 +48,7 @@ data:
|
||||||
host: ${env.POSTGRES_HOST:=localhost}
|
host: ${env.POSTGRES_HOST:=localhost}
|
||||||
port: ${env.POSTGRES_PORT:=5432}
|
port: ${env.POSTGRES_PORT:=5432}
|
||||||
db: ${env.POSTGRES_DB:=llamastack}
|
db: ${env.POSTGRES_DB:=llamastack}
|
||||||
user: ${env.POSTGRES_USER:llamastack}
|
user: ${env.POSTGRES_USER:=llamastack}
|
||||||
password: ${env.POSTGRES_PASSWORD:=llamastack}
|
password: ${env.POSTGRES_PASSWORD:=llamastack}
|
||||||
responses_store:
|
responses_store:
|
||||||
type: postgres
|
type: postgres
|
||||||
|
@ -61,8 +61,8 @@ data:
|
||||||
- provider_id: meta-reference
|
- provider_id: meta-reference
|
||||||
provider_type: inline::meta-reference
|
provider_type: inline::meta-reference
|
||||||
config:
|
config:
|
||||||
service_name: ${env.OTEL_SERVICE_NAME:+}
|
service_name: "${env.OTEL_SERVICE_NAME:=\u200B}"
|
||||||
sinks: ${env.TELEMETRY_SINKS:console}
|
sinks: ${env.TELEMETRY_SINKS:=console}
|
||||||
tool_runtime:
|
tool_runtime:
|
||||||
- provider_id: brave-search
|
- provider_id: brave-search
|
||||||
provider_type: remote::brave-search
|
provider_type: remote::brave-search
|
||||||
|
|
|
@ -30,7 +30,7 @@ providers:
|
||||||
- provider_id: ${env.ENABLE_CHROMADB:+chromadb}
|
- provider_id: ${env.ENABLE_CHROMADB:+chromadb}
|
||||||
provider_type: remote::chromadb
|
provider_type: remote::chromadb
|
||||||
config:
|
config:
|
||||||
url: ${env.CHROMADB_URL:+}
|
url: ${env.CHROMADB_URL:=}
|
||||||
safety:
|
safety:
|
||||||
- provider_id: llama-guard
|
- provider_id: llama-guard
|
||||||
provider_type: inline::llama-guard
|
provider_type: inline::llama-guard
|
||||||
|
@ -58,8 +58,8 @@ providers:
|
||||||
- provider_id: meta-reference
|
- provider_id: meta-reference
|
||||||
provider_type: inline::meta-reference
|
provider_type: inline::meta-reference
|
||||||
config:
|
config:
|
||||||
service_name: ${env.OTEL_SERVICE_NAME:+console}
|
service_name: "${env.OTEL_SERVICE_NAME:=\u200B}"
|
||||||
sinks: ${env.TELEMETRY_SINKS:+console}
|
sinks: ${env.TELEMETRY_SINKS:=console}
|
||||||
tool_runtime:
|
tool_runtime:
|
||||||
- provider_id: brave-search
|
- provider_id: brave-search
|
||||||
provider_type: remote::brave-search
|
provider_type: remote::brave-search
|
||||||
|
|
|
@ -25,16 +25,8 @@ spec:
|
||||||
app.kubernetes.io/name: vllm
|
app.kubernetes.io/name: vllm
|
||||||
workload-type: inference
|
workload-type: inference
|
||||||
spec:
|
spec:
|
||||||
affinity:
|
nodeSelector:
|
||||||
podAntiAffinity:
|
eks.amazonaws.com/nodegroup: gpu
|
||||||
requiredDuringSchedulingIgnoredDuringExecution:
|
|
||||||
- labelSelector:
|
|
||||||
matchExpressions:
|
|
||||||
- key: workload-type
|
|
||||||
operator: In
|
|
||||||
values:
|
|
||||||
- inference
|
|
||||||
topologyKey: kubernetes.io/hostname # Ensures no two inference pods on same node
|
|
||||||
containers:
|
containers:
|
||||||
- name: vllm
|
- name: vllm
|
||||||
image: vllm/vllm-openai:latest
|
image: vllm/vllm-openai:latest
|
||||||
|
@ -42,6 +34,8 @@ spec:
|
||||||
args:
|
args:
|
||||||
- "vllm serve ${INFERENCE_MODEL} --dtype float16 --enforce-eager --max-model-len 4096 --gpu-memory-utilization 0.6"
|
- "vllm serve ${INFERENCE_MODEL} --dtype float16 --enforce-eager --max-model-len 4096 --gpu-memory-utilization 0.6"
|
||||||
env:
|
env:
|
||||||
|
- name: INFERENCE_MODEL
|
||||||
|
value: "${INFERENCE_MODEL}"
|
||||||
- name: HUGGING_FACE_HUB_TOKEN
|
- name: HUGGING_FACE_HUB_TOKEN
|
||||||
valueFrom:
|
valueFrom:
|
||||||
secretKeyRef:
|
secretKeyRef:
|
||||||
|
@ -49,6 +43,11 @@ spec:
|
||||||
key: token
|
key: token
|
||||||
ports:
|
ports:
|
||||||
- containerPort: 8000
|
- containerPort: 8000
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
nvidia.com/gpu: 1
|
||||||
|
requests:
|
||||||
|
nvidia.com/gpu: 1
|
||||||
volumeMounts:
|
volumeMounts:
|
||||||
- name: llama-storage
|
- name: llama-storage
|
||||||
mountPath: /root/.cache/huggingface
|
mountPath: /root/.cache/huggingface
|
||||||
|
|
|
@ -6,7 +6,6 @@ spec:
|
||||||
accessModes:
|
accessModes:
|
||||||
- ReadWriteOnce
|
- ReadWriteOnce
|
||||||
volumeMode: Filesystem
|
volumeMode: Filesystem
|
||||||
storageClassName: gp2
|
|
||||||
resources:
|
resources:
|
||||||
requests:
|
requests:
|
||||||
storage: 30Gi
|
storage: 30Gi
|
||||||
|
@ -26,16 +25,8 @@ spec:
|
||||||
app.kubernetes.io/name: vllm-safety
|
app.kubernetes.io/name: vllm-safety
|
||||||
workload-type: inference
|
workload-type: inference
|
||||||
spec:
|
spec:
|
||||||
affinity:
|
nodeSelector:
|
||||||
podAntiAffinity:
|
eks.amazonaws.com/nodegroup: gpu
|
||||||
requiredDuringSchedulingIgnoredDuringExecution:
|
|
||||||
- labelSelector:
|
|
||||||
matchExpressions:
|
|
||||||
- key: workload-type
|
|
||||||
operator: In
|
|
||||||
values:
|
|
||||||
- inference
|
|
||||||
topologyKey: kubernetes.io/hostname # Ensures no two inference pods on same node
|
|
||||||
containers:
|
containers:
|
||||||
- name: vllm-safety
|
- name: vllm-safety
|
||||||
image: vllm/vllm-openai:latest
|
image: vllm/vllm-openai:latest
|
||||||
|
@ -44,6 +35,8 @@ spec:
|
||||||
"vllm serve ${SAFETY_MODEL} --dtype float16 --enforce-eager --max-model-len 4096 --port 8001 --gpu-memory-utilization 0.3"
|
"vllm serve ${SAFETY_MODEL} --dtype float16 --enforce-eager --max-model-len 4096 --port 8001 --gpu-memory-utilization 0.3"
|
||||||
]
|
]
|
||||||
env:
|
env:
|
||||||
|
- name: SAFETY_MODEL
|
||||||
|
value: "${SAFETY_MODEL}"
|
||||||
- name: HUGGING_FACE_HUB_TOKEN
|
- name: HUGGING_FACE_HUB_TOKEN
|
||||||
valueFrom:
|
valueFrom:
|
||||||
secretKeyRef:
|
secretKeyRef:
|
||||||
|
@ -51,6 +44,11 @@ spec:
|
||||||
key: token
|
key: token
|
||||||
ports:
|
ports:
|
||||||
- containerPort: 8001
|
- containerPort: 8001
|
||||||
|
resources:
|
||||||
|
limits:
|
||||||
|
nvidia.com/gpu: 1
|
||||||
|
requests:
|
||||||
|
nvidia.com/gpu: 1
|
||||||
volumeMounts:
|
volumeMounts:
|
||||||
- name: llama-storage
|
- name: llama-storage
|
||||||
mountPath: /root/.cache/huggingface
|
mountPath: /root/.cache/huggingface
|
||||||
|
|
|
@ -5,6 +5,8 @@ Instead of starting the Llama Stack and vLLM servers locally. We can deploy them
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
In this guide, we'll use a local [Kind](https://kind.sigs.k8s.io/) cluster and a vLLM inference service in the same cluster for demonstration purposes.
|
In this guide, we'll use a local [Kind](https://kind.sigs.k8s.io/) cluster and a vLLM inference service in the same cluster for demonstration purposes.
|
||||||
|
|
||||||
|
Note: You can also deploy the Llama Stack server in an AWS EKS cluster. See [Deploying Llama Stack Server in AWS EKS](#deploying-llama-stack-server-in-aws-eks) for more details.
|
||||||
|
|
||||||
First, create a local Kubernetes cluster via Kind:
|
First, create a local Kubernetes cluster via Kind:
|
||||||
|
|
||||||
```
|
```
|
||||||
|
@ -217,3 +219,18 @@ Finally, we forward the Kubernetes service to a local port and test some inferen
|
||||||
kubectl port-forward service/llama-stack-service 5000:5000
|
kubectl port-forward service/llama-stack-service 5000:5000
|
||||||
llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"
|
llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Deploying Llama Stack Server in AWS EKS
|
||||||
|
|
||||||
|
We've also provided a script to deploy the Llama Stack server in an AWS EKS cluster. Once you have an [EKS cluster](https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html), you can run the following script to deploy the Llama Stack server.
|
||||||
|
|
||||||
|
|
||||||
|
```
|
||||||
|
cd docs/source/distributions/eks
|
||||||
|
./apply.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
This script will:
|
||||||
|
|
||||||
|
- Set up a default storage class for AWS EKS
|
||||||
|
- Deploy the Llama Stack server in a Kubernetes Pod and Service
|
Loading…
Add table
Add a link
Reference in a new issue