mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-31 06:19:59 +00:00
Merge branch 'main' into docs-4
This commit is contained in:
commit
d7c976c6d2
38 changed files with 4709 additions and 8876 deletions
3
docs/_static/css/my_theme.css
vendored
3
docs/_static/css/my_theme.css
vendored
|
|
@ -20,3 +20,6 @@
|
|||
h3 {
|
||||
font-weight: normal;
|
||||
}
|
||||
html[data-theme="dark"] .rst-content div[class^="highlight"] {
|
||||
background-color: #0b0b0b;
|
||||
}
|
||||
|
|
|
|||
9
docs/_static/js/detect_theme.js
vendored
Normal file
9
docs/_static/js/detect_theme.js
vendored
Normal file
|
|
@ -0,0 +1,9 @@
|
|||
document.addEventListener("DOMContentLoaded", function () {
|
||||
const prefersDark = window.matchMedia("(prefers-color-scheme: dark)").matches;
|
||||
const htmlElement = document.documentElement;
|
||||
if (prefersDark) {
|
||||
htmlElement.setAttribute("data-theme", "dark");
|
||||
} else {
|
||||
htmlElement.setAttribute("data-theme", "light");
|
||||
}
|
||||
});
|
||||
|
|
@ -112,6 +112,8 @@ html_theme_options = {
|
|||
# "style_nav_header_background": "#c3c9d4",
|
||||
}
|
||||
|
||||
default_dark_mode = False
|
||||
|
||||
html_static_path = ["../_static"]
|
||||
# html_logo = "../_static/llama-stack-logo.png"
|
||||
# html_style = "../_static/css/my_theme.css"
|
||||
|
|
@ -119,6 +121,7 @@ html_static_path = ["../_static"]
|
|||
|
||||
def setup(app):
|
||||
app.add_css_file("css/my_theme.css")
|
||||
app.add_js_file("js/detect_theme.js")
|
||||
|
||||
def dockerhub_role(name, rawtext, text, lineno, inliner, options={}, content=[]):
|
||||
url = f"https://hub.docker.com/r/llamastack/{text}"
|
||||
|
|
|
|||
|
|
@ -7,13 +7,13 @@ In this guide, we'll use a local [Kind](https://kind.sigs.k8s.io/) cluster and a
|
|||
|
||||
First, create a local Kubernetes cluster via Kind:
|
||||
|
||||
```bash
|
||||
```
|
||||
kind create cluster --image kindest/node:v1.32.0 --name llama-stack-test
|
||||
```
|
||||
|
||||
First, create a Kubernetes PVC and Secret for downloading and storing Hugging Face model:
|
||||
|
||||
```bash
|
||||
```
|
||||
cat <<EOF |kubectl apply -f -
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
|
|
@ -39,7 +39,7 @@ data:
|
|||
|
||||
Next, start the vLLM server as a Kubernetes Deployment and Service:
|
||||
|
||||
```bash
|
||||
```
|
||||
cat <<EOF |kubectl apply -f -
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
|
|
@ -95,7 +95,7 @@ EOF
|
|||
|
||||
We can verify that the vLLM server has started successfully via the logs (this might take a couple of minutes to download the model):
|
||||
|
||||
```bash
|
||||
```
|
||||
$ kubectl logs -l app.kubernetes.io/name=vllm
|
||||
...
|
||||
INFO: Started server process [1]
|
||||
|
|
@ -119,7 +119,7 @@ providers:
|
|||
|
||||
Once we have defined the run configuration for Llama Stack, we can build an image with that configuration and the server source code:
|
||||
|
||||
```bash
|
||||
```
|
||||
cat >/tmp/test-vllm-llama-stack/Containerfile.llama-stack-run-k8s <<EOF
|
||||
FROM distribution-myenv:dev
|
||||
|
||||
|
|
@ -135,7 +135,7 @@ podman build -f /tmp/test-vllm-llama-stack/Containerfile.llama-stack-run-k8s -t
|
|||
|
||||
We can then start the Llama Stack server by deploying a Kubernetes Pod and Service:
|
||||
|
||||
```bash
|
||||
```
|
||||
cat <<EOF |kubectl apply -f -
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
|
|
@ -195,7 +195,7 @@ EOF
|
|||
### Verifying the Deployment
|
||||
We can check that the LlamaStack server has started:
|
||||
|
||||
```bash
|
||||
```
|
||||
$ kubectl logs -l app.kubernetes.io/name=llama-stack
|
||||
...
|
||||
INFO: Started server process [1]
|
||||
|
|
@ -207,7 +207,7 @@ INFO: Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit
|
|||
|
||||
Finally, we forward the Kubernetes service to a local port and test some inference requests against it via the Llama Stack Client:
|
||||
|
||||
```bash
|
||||
```
|
||||
kubectl port-forward service/llama-stack-service 5000:5000
|
||||
llama-stack-client --endpoint http://localhost:5000 inference chat-completion --message "hello, what model are you?"
|
||||
```
|
||||
|
|
|
|||
|
|
@ -25,7 +25,7 @@ The `llamastack/distribution-remote-vllm` distribution consists of the following
|
|||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
|
||||
You can use this distribution if you have GPUs and want to run an independent vLLM server container for running inference.
|
||||
You can use this distribution if you want to run an independent vLLM server for inference.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
|
|
@ -41,7 +41,10 @@ The following environment variables can be configured:
|
|||
|
||||
## Setting up vLLM server
|
||||
|
||||
Both AMD and NVIDIA GPUs can serve as accelerators for the vLLM server, which acts as both the LLM inference provider and the safety provider.
|
||||
In the following sections, we'll use either AMD and NVIDIA GPUs to serve as hardware accelerators for the vLLM
|
||||
server, which acts as both the LLM inference provider and the safety provider. Note that vLLM also
|
||||
[supports many other hardware accelerators](https://docs.vllm.ai/en/latest/getting_started/installation.html) and
|
||||
that we only use GPUs here for demonstration purposes.
|
||||
|
||||
### Setting up vLLM server on AMD GPU
|
||||
|
||||
|
|
|
|||
|
|
@ -103,7 +103,5 @@ llama stack run together
|
|||
|
||||
2. Start Streamlit UI
|
||||
```bash
|
||||
cd llama_stack/distribution/ui
|
||||
pip install -r requirements.txt
|
||||
streamlit run app.py
|
||||
uv run --with ".[ui]" streamlit run llama_stack/distribution/ui/app.py
|
||||
```
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue