old md files deprecation

2025-12-07 18:57:21 +00:00 · 2025-09-22 13:43:53 -07:00 · 2025-09-22 13:43:53 -07:00 · c8e0fc1a7d
commit c8e0fc1a7d
parent 5b9bea02c3
173 changed files with 0 additions and 12955 deletions
--- a/docs/source/distributions/self_hosted_distro/dell-tgi.md
+++ b/docs/source/distributions/self_hosted_distro/dell-tgi.md
@ -1,78 +0,0 @@
---
-orphan: true
---
-# Dell-TGI Distribution
-
-```{toctree}
-:maxdepth: 2
-:hidden:
-
-self
-```
-
-The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
-
-
-| **API**         	| **Inference** 	| **Agents**     	| **Memory**                                       	| **Safety**     	| **Telemetry**  	|
-|-----------------	|---------------	|----------------	|--------------------------------------------------	|----------------	|----------------	|
-| **Provider(s)** 	| remote::tgi   	| meta-reference 	| meta-reference, remote::pgvector, remote::chroma 	| meta-reference 	| meta-reference 	|
-
-
-The only difference vs. the `tgi` distribution is that it runs the Dell-TGI server for inference.
-
-
-### Start the Distribution (Single Node GPU)
-
-> [!NOTE]
-> This assumes you have access to GPU to start a TGI server with access to your GPU.
-
-```
-$ cd distributions/dell-tgi/
-$ ls
-compose.yaml  README.md  run.yaml
-$ docker compose up
-```
-
-The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should be able to see the following outputs --
-```
-[text-generation-inference] | 2024-10-15T18:56:33.810397Z  INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
-[text-generation-inference] | 2024-10-15T18:56:33.810448Z  WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
-[text-generation-inference] | 2024-10-15T18:56:33.864143Z  INFO text_generation_router::server: router/src/server.rs:2353: Connected
-INFO:     Started server process [1]
-INFO:     Waiting for application startup.
-INFO:     Application startup complete.
-INFO:     Uvicorn running on http://[::]:8321 (Press CTRL+C to quit)
-```
-
-To kill the server
-```
-docker compose down
-```
-
-### (Alternative) Dell-TGI server + llama stack run (Single Node GPU)
-
-#### Start Dell-TGI server locally
-```
-docker run -it --pull always --shm-size 1g -p 80:80 --gpus 4 \
-e NUM_SHARD=4
-e MAX_BATCH_PREFILL_TOKENS=32768 \
-e MAX_INPUT_TOKENS=8000 \
-e MAX_TOTAL_TOKENS=8192 \
-registry.dell.huggingface.co/enterprise-dell-inference-meta-llama-meta-llama-3.1-8b-instruct
-```
-
-
-#### Start Llama Stack server pointing to TGI server
-
-```
-docker run --pull always --network host -it -p 8321:8321 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
-```
-
-Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.
-```
-inference:
-  - provider_id: tgi0
-    provider_type: remote::tgi
-    config:
-      url: http://127.0.0.1:5009
-```
--- a/docs/source/distributions/self_hosted_distro/dell.md
+++ b/docs/source/distributions/self_hosted_distro/dell.md
@ -1,190 +0,0 @@
---
-orphan: true
---
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-
-# Dell Distribution of Llama Stack
-
-```{toctree}
-:maxdepth: 2
-:hidden:
-
-self
-```
-
-The `llamastack/distribution-dell` distribution consists of the following provider configurations.
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| datasetio | `remote::huggingface`, `inline::localfs` |
-| eval | `inline::meta-reference` |
-| inference | `remote::tgi`, `inline::sentence-transformers` |
-| safety | `inline::llama-guard` |
-| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime` |
-| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
-
-
-You can use this distribution if you have GPUs and want to run an independent TGI or Dell Enterprise Hub container for running inference.
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `DEH_URL`: URL for the Dell inference server (default: `http://0.0.0.0:8181`)
- `DEH_SAFETY_URL`: URL for the Dell safety inference server (default: `http://0.0.0.0:8282`)
- `CHROMA_URL`: URL for the Chroma server (default: `http://localhost:6601`)
- `INFERENCE_MODEL`: Inference model loaded into the TGI server (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
-
-
-## Setting up Inference server using Dell Enterprise Hub's custom TGI container.
-
-NOTE: This is a placeholder to run inference with TGI. This will be updated to use [Dell Enterprise Hub's containers](https://dell.huggingface.co/authenticated/models) once verified.
-
-```bash
-export INFERENCE_PORT=8181
-export DEH_URL=http://0.0.0.0:$INFERENCE_PORT
-export INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
-export CHROMADB_HOST=localhost
-export CHROMADB_PORT=6601
-export CHROMA_URL=http://$CHROMADB_HOST:$CHROMADB_PORT
-export CUDA_VISIBLE_DEVICES=0
-export LLAMA_STACK_PORT=8321
-
-docker run --rm -it \
-  --pull always \
-  --network host \
-  -v $HOME/.cache/huggingface:/data \
-  -e HF_TOKEN=$HF_TOKEN \
-  -p $INFERENCE_PORT:$INFERENCE_PORT \
-  --gpus $CUDA_VISIBLE_DEVICES \
-  ghcr.io/huggingface/text-generation-inference \
-  --dtype bfloat16 \
-  --usage-stats off \
-  --sharded false \
-  --cuda-memory-fraction 0.7 \
-  --model-id $INFERENCE_MODEL \
-  --port $INFERENCE_PORT --hostname 0.0.0.0
-```
-
-If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
-
-```bash
-export SAFETY_INFERENCE_PORT=8282
-export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
-export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
-export CUDA_VISIBLE_DEVICES=1
-
-docker run --rm -it \
-  --pull always \
-  --network host \
-  -v $HOME/.cache/huggingface:/data \
-  -e HF_TOKEN=$HF_TOKEN \
-  -p $SAFETY_INFERENCE_PORT:$SAFETY_INFERENCE_PORT \
-  --gpus $CUDA_VISIBLE_DEVICES \
-  ghcr.io/huggingface/text-generation-inference \
-  --dtype bfloat16 \
-  --usage-stats off \
-  --sharded false \
-  --cuda-memory-fraction 0.7 \
-  --model-id $SAFETY_MODEL \
-  --hostname 0.0.0.0 \
-  --port $SAFETY_INFERENCE_PORT
-```
-
-## Dell distribution relies on ChromaDB for vector database usage
-
-You can start a chroma-db easily using docker.
-```bash
-# This is where the indices are persisted
-mkdir -p $HOME/chromadb
-
-podman run --rm -it \
-  --network host \
-  --name chromadb \
-  -v $HOME/chromadb:/chroma/chroma \
-  -e IS_PERSISTENT=TRUE \
-  chromadb/chroma:latest \
-  --port $CHROMADB_PORT \
-  --host $CHROMADB_HOST
-```
-
-## Running Llama Stack
-
-Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via venv or Docker which has a pre-built image.
-
-### Via Docker
-
-This method allows you to get started quickly without having to build the distribution code.
-
-```bash
-docker run -it \
-  --pull always \
-  --network host \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v $HOME/.llama:/root/.llama \
-  # NOTE: mount the llama-stack / llama-model directories if testing local changes else not needed
-  -v /home/hjshah/git/llama-stack:/app/llama-stack-source -v /home/hjshah/git/llama-models:/app/llama-models-source \
-  # localhost/distribution-dell:dev if building / testing locally
-  llamastack/distribution-dell\
-  --port $LLAMA_STACK_PORT  \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env DEH_URL=$DEH_URL \
-  --env CHROMA_URL=$CHROMA_URL
-
-```
-
-If you are using Llama Stack Safety / Shield APIs, use:
-
-```bash
-# You need a local checkout of llama-stack to run this, get it using
-# git clone https://github.com/meta-llama/llama-stack.git
-cd /path/to/llama-stack
-
-export SAFETY_INFERENCE_PORT=8282
-export DEH_SAFETY_URL=http://0.0.0.0:$SAFETY_INFERENCE_PORT
-export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
-
-docker run \
-  -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v $HOME/.llama:/root/.llama \
-  -v ./llama_stack/distributions/tgi/run-with-safety.yaml:/root/my-run.yaml \
-  llamastack/distribution-dell \
-  --config /root/my-run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env DEH_URL=$DEH_URL \
-  --env SAFETY_MODEL=$SAFETY_MODEL \
-  --env DEH_SAFETY_URL=$DEH_SAFETY_URL \
-  --env CHROMA_URL=$CHROMA_URL
-```
-
-### Via venv
-
-Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.
-
-```bash
-llama stack build --distro dell --image-type venv
-llama stack run dell
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env DEH_URL=$DEH_URL \
-  --env CHROMA_URL=$CHROMA_URL
-```
-
-If you are using Llama Stack Safety / Shield APIs, use:
-
-```bash
-llama stack run ./run-with-safety.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env DEH_URL=$DEH_URL \
-  --env SAFETY_MODEL=$SAFETY_MODEL \
-  --env DEH_SAFETY_URL=$DEH_SAFETY_URL \
-  --env CHROMA_URL=$CHROMA_URL
-```
--- a/docs/source/distributions/self_hosted_distro/meta-reference-gpu.md
+++ b/docs/source/distributions/self_hosted_distro/meta-reference-gpu.md
@ -1,125 +0,0 @@
---
-orphan: true
---
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-# Meta Reference GPU Distribution
-
-```{toctree}
-:maxdepth: 2
-:hidden:
-
-self
-```
-
-The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations:
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| datasetio | `remote::huggingface`, `inline::localfs` |
-| eval | `inline::meta-reference` |
-| inference | `inline::meta-reference` |
-| safety | `inline::llama-guard` |
-| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
-| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
-
-
-Note that you need access to nvidia GPUs to run this distribution. This distribution is not compatible with CPU-only machines or machines with AMD GPUs.
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
- `INFERENCE_MODEL`: Inference model loaded into the Meta Reference server (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `INFERENCE_CHECKPOINT_DIR`: Directory containing the Meta Reference model checkpoint (default: `null`)
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
- `SAFETY_CHECKPOINT_DIR`: Directory containing the Llama-Guard model checkpoint (default: `null`)
-
-
-## Prerequisite: Downloading Models
-
-Please use `llama model list --downloaded` to check that you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](../../references/llama_cli_reference/download_models.md) here to download the models. Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
-
-```
-$ llama model list --downloaded
-┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
-┃ Model                                   ┃ Size     ┃ Modified Time       ┃
-┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
-│ Llama3.2-1B-Instruct:int4-qlora-eo8     │ 1.53 GB  │ 2025-02-26 11:22:28 │
-├─────────────────────────────────────────┼──────────┼─────────────────────┤
-│ Llama3.2-1B                             │ 2.31 GB  │ 2025-02-18 21:48:52 │
-├─────────────────────────────────────────┼──────────┼─────────────────────┤
-│ Prompt-Guard-86M                        │ 0.02 GB  │ 2025-02-26 11:29:28 │
-├─────────────────────────────────────────┼──────────┼─────────────────────┤
-│ Llama3.2-3B-Instruct:int4-spinquant-eo8 │ 3.69 GB  │ 2025-02-26 11:37:41 │
-├─────────────────────────────────────────┼──────────┼─────────────────────┤
-│ Llama3.2-3B                             │ 5.99 GB  │ 2025-02-18 21:51:26 │
-├─────────────────────────────────────────┼──────────┼─────────────────────┤
-│ Llama3.1-8B                             │ 14.97 GB │ 2025-02-16 10:36:37 │
-├─────────────────────────────────────────┼──────────┼─────────────────────┤
-│ Llama3.2-1B-Instruct:int4-spinquant-eo8 │ 1.51 GB  │ 2025-02-26 11:35:02 │
-├─────────────────────────────────────────┼──────────┼─────────────────────┤
-│ Llama-Guard-3-1B                        │ 2.80 GB  │ 2025-02-26 11:20:46 │
-├─────────────────────────────────────────┼──────────┼─────────────────────┤
-│ Llama-Guard-3-1B:int4                   │ 0.43 GB  │ 2025-02-26 11:33:33 │
-└─────────────────────────────────────────┴──────────┴─────────────────────┘
-```
-
-## Running the Distribution
-
-You can do this via venv or Docker which has a pre-built image.
-
-### Via Docker
-
-This method allows you to get started quickly without having to build the distribution code.
-
-```bash
-LLAMA_STACK_PORT=8321
-docker run \
-  -it \
-  --pull always \
-  --gpu all \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v ~/.llama:/root/.llama \
-  llamastack/distribution-meta-reference-gpu \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
-```
-
-If you are using Llama Stack Safety / Shield APIs, use:
-
-```bash
-docker run \
-  -it \
-  --pull always \
-  --gpu all \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v ~/.llama:/root/.llama \
-  llamastack/distribution-meta-reference-gpu \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
-  --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
-```
-
-### Via venv
-
-Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
-
-```bash
-llama stack build --distro meta-reference-gpu --image-type venv
-llama stack run distributions/meta-reference-gpu/run.yaml \
-  --port 8321 \
-  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
-```
-
-If you are using Llama Stack Safety / Shield APIs, use:
-
-```bash
-llama stack run distributions/meta-reference-gpu/run-with-safety.yaml \
-  --port 8321 \
-  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
-  --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
-```
--- a/docs/source/distributions/self_hosted_distro/nvidia.md
+++ b/docs/source/distributions/self_hosted_distro/nvidia.md
@ -1,171 +0,0 @@
---
-orphan: true
---
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-# NVIDIA Distribution
-
-The `llamastack/distribution-nvidia` distribution consists of the following provider configurations.
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| datasetio | `inline::localfs`, `remote::nvidia` |
-| eval | `remote::nvidia` |
-| files | `inline::localfs` |
-| inference | `remote::nvidia` |
-| post_training | `remote::nvidia` |
-| safety | `remote::nvidia` |
-| scoring | `inline::basic` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `inline::rag-runtime` |
-| vector_io | `inline::faiss` |
-
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `NVIDIA_API_KEY`: NVIDIA API Key (default: ``)
- `NVIDIA_APPEND_API_VERSION`: Whether to append the API version to the base_url (default: `True`)
- `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`)
- `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`)
- `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`)
- `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`)
- `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`)
- `NVIDIA_GUARDRAILS_CONFIG_ID`: NVIDIA Guardrail Configuration ID (default: `self-check`)
- `NVIDIA_EVALUATOR_URL`: URL for the NeMo Evaluator Service (default: `http://0.0.0.0:7331`)
- `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`)
- `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`)
-
-### Models
-
-The following models are available by default:
-
- `meta/llama3-8b-instruct `
- `meta/llama3-70b-instruct `
- `meta/llama-3.1-8b-instruct `
- `meta/llama-3.1-70b-instruct `
- `meta/llama-3.1-405b-instruct `
- `meta/llama-3.2-1b-instruct `
- `meta/llama-3.2-3b-instruct `
- `meta/llama-3.2-11b-vision-instruct `
- `meta/llama-3.2-90b-vision-instruct `
- `meta/llama-3.3-70b-instruct `
- `nvidia/vila `
- `nvidia/llama-3.2-nv-embedqa-1b-v2 `
- `nvidia/nv-embedqa-e5-v5 `
- `nvidia/nv-embedqa-mistral-7b-v2 `
- `snowflake/arctic-embed-l `
-
-
-## Prerequisites
-### NVIDIA API Keys
-
-Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable.
-
-### Deploy NeMo Microservices Platform
-The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for platform prerequisites and instructions to install and deploy the platform.
-
-## Supported Services
-Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints.
-
-### Inference: NVIDIA NIM
-NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs:
-  1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key)
-  2. Self-hosted: NVIDIA NIMs that run on your own infrastructure.
-
-The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment.
-
-### Datasetio API: NeMo Data Store
-The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint.
-
-See the {repopath}`NVIDIA Datasetio docs::llama_stack/providers/remote/datasetio/nvidia/README.md` for supported features and example usage.
-
-### Eval API: NeMo Evaluator
-The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint.
-
-See the {repopath}`NVIDIA Eval docs::llama_stack/providers/remote/eval/nvidia/README.md` for supported features and example usage.
-
-### Post-Training API: NeMo Customizer
-The NeMo Customizer microservice supports fine-tuning models. You can reference {repopath}`this list of supported models::llama_stack/providers/remote/post_training/nvidia/models.py` that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint.
-
-See the {repopath}`NVIDIA Post-Training docs::llama_stack/providers/remote/post_training/nvidia/README.md` for supported features and example usage.
-
-### Safety API: NeMo Guardrails
-The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint.
-
-See the {repopath}`NVIDIA Safety docs::llama_stack/providers/remote/safety/nvidia/README.md` for supported features and example usage.
-
-## Deploying models
-In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`.
-
-Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart.
-```sh
-# URL to NeMo NIM Proxy service
-export NEMO_URL="http://nemo.test"
-
-curl --location "$NEMO_URL/v1/deployment/model-deployments" \
-   -H 'accept: application/json' \
-   -H 'Content-Type: application/json' \
-   -d '{
-      "name": "llama-3.2-1b-instruct",
-      "namespace": "meta",
-      "config": {
-         "model": "meta/llama-3.2-1b-instruct",
-         "nim_deployment": {
-            "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct",
-            "image_tag": "1.8.3",
-            "pvc_size": "25Gi",
-            "gpu": 1,
-            "additional_envs": {
-               "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
-            }
-         }
-      }
-   }'
-```
-This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html) for more information on how to deploy a NIM and verify it's available for inference.
-
-You can also remove a deployed NIM to free up GPU resources, if needed.
-```sh
-export NEMO_URL="http://nemo.test"
-
-curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct"
-```
-
-## Running Llama Stack with NVIDIA
-
-You can do this via venv (build code), or Docker which has a pre-built image.
-
-### Via Docker
-
-This method allows you to get started quickly without having to build the distribution code.
-
-```bash
-LLAMA_STACK_PORT=8321
-docker run \
-  -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v ./run.yaml:/root/my-run.yaml \
-  llamastack/distribution-nvidia \
-  --config /root/my-run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env NVIDIA_API_KEY=$NVIDIA_API_KEY
-```
-
-### Via venv
-
-If you've set up your local development environment, you can also build the image using your local virtual environment.
-
-```bash
-INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
-llama stack build --distro nvidia --image-type venv
-llama stack run ./run.yaml \
-  --port 8321 \
-  --env NVIDIA_API_KEY=$NVIDIA_API_KEY \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL
-```
-
-## Example Notebooks
-For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in {repopath}`docs/notebooks/nvidia`.
--- a/docs/source/distributions/self_hosted_distro/passthrough.md
+++ b/docs/source/distributions/self_hosted_distro/passthrough.md
@ -1,42 +0,0 @@
---
-orphan: true
---
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-# Passthrough Distribution
-
-```{toctree}
-:maxdepth: 2
-:hidden:
-
-self
-```
-
-The `llamastack/distribution-passthrough` distribution consists of the following provider configurations.
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| datasetio | `remote::huggingface`, `inline::localfs` |
-| eval | `inline::meta-reference` |
-| inference | `remote::passthrough`, `inline::sentence-transformers` |
-| safety | `inline::llama-guard` |
-| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `remote::wolfram-alpha`, `inline::rag-runtime`, `remote::model-context-protocol` |
-| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
-
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
- `PASSTHROUGH_API_KEY`: Passthrough API Key (default: ``)
- `PASSTHROUGH_URL`: Passthrough URL (default: ``)
-
-### Models
-
-The following models are available by default:
-
- `llama3.1-8b-instruct `
- `llama3.2-11b-vision-instruct `
--- a/docs/source/distributions/self_hosted_distro/starter.md
+++ b/docs/source/distributions/self_hosted_distro/starter.md
@ -1,232 +0,0 @@
---
-orphan: true
---
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-# Starter Distribution
-
-```{toctree}
-:maxdepth: 2
-:hidden:
-
-self
-```
-
-The `llamastack/distribution-starter` distribution is a comprehensive, multi-provider distribution that includes most of the available inference providers in Llama Stack. It's designed to be a one-stop solution for developers who want to experiment with different AI providers without having to configure each one individually.
-
-## Provider Composition
-
-The starter distribution consists of the following provider configurations:
-
-| API | Provider(s)                                                                                                                                                                                                                                                                                                                                    |
-|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| agents | `inline::meta-reference`                                                                                                                                                                                                                                                                                                                       |
-| datasetio | `remote::huggingface`, `inline::localfs`                                                                                                                                                                                                                                                                                                       |
-| eval | `inline::meta-reference`                                                                                                                                                                                                                                                                                                                       |
-| files | `inline::localfs`                                                                                                                                                                                                                                                                                                                              |
-| inference | `remote::openai`, `remote::fireworks`, `remote::together`, `remote::ollama`, `remote::anthropic`, `remote::gemini`, `remote::groq`, `remote::sambanova`, `remote::vllm`, `remote::tgi`, `remote::cerebras`, `remote::llama-openai-compat`, `remote::nvidia`, `remote::hf::serverless`, `remote::hf::endpoint`, `inline::sentence-transformers` |
-| safety | `inline::llama-guard`                                                                                                                                                                                                                                                                                                                          |
-| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust`                                                                                                                                                                                                                                                                                  |
-| telemetry | `inline::meta-reference`                                                                                                                                                                                                                                                                                                                       |
-| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol`                                                                                                                                                                                                                                       |
-| vector_io | `inline::faiss`, `inline::sqlite-vec`, `inline::milvus`, `remote::chromadb`, `remote::pgvector`                                                                                                                                                                                                                                                 |
-
-## Inference Providers
-
-The starter distribution includes a comprehensive set of inference providers:
-
-### Hosted Providers
- **[OpenAI](https://openai.com/api/)**: GPT-4, GPT-3.5, O1, O3, O4 models and text embeddings -
-  provider ID: `openai` - reference documentation: [openai](../../providers/inference/remote_openai.md)
- **[Fireworks](https://fireworks.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and
-  embeddings - provider ID: `fireworks` - reference documentation: [fireworks](../../providers/inference/remote_fireworks.md)
- **[Together](https://together.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and
-  embeddings - provider ID: `together` - reference documentation: [together](../../providers/inference/remote_together.md)
- **[Anthropic](https://www.anthropic.com/)**: Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Haiku, and Voyage embeddings - provider ID: `anthropic` - reference documentation: [anthropic](../../providers/inference/remote_anthropic.md)
- **[Gemini](https://gemini.google.com/)**: Gemini 1.5, 2.0, 2.5 models and text embeddings - provider ID: `gemini` - reference documentation: [gemini](../../providers/inference/remote_gemini.md)
- **[Groq](https://groq.com/)**: Fast Llama models (3.1, 3.2, 3.3, 4 Scout, 4 Maverick) - provider ID: `groq` - reference documentation: [groq](../../providers/inference/remote_groq.md)
- **[SambaNova](https://www.sambanova.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models - provider ID: `sambanova` - reference documentation: [sambanova](../../providers/inference/remote_sambanova.md)
- **[Cerebras](https://www.cerebras.ai/)**: Cerebras AI models - provider ID: `cerebras` - reference documentation: [cerebras](../../providers/inference/remote_cerebras.md)
- **[NVIDIA](https://www.nvidia.com/)**: NVIDIA NIM - provider ID: `nvidia` - reference documentation: [nvidia](../../providers/inference/remote_nvidia.md)
- **[HuggingFace](https://huggingface.co/)**: Serverless and endpoint models - provider ID: `hf::serverless` and `hf::endpoint` - reference documentation: [huggingface-serverless](../../providers/inference/remote_hf_serverless.md) and [huggingface-endpoint](../../providers/inference/remote_hf_endpoint.md)
- **[Bedrock](https://aws.amazon.com/bedrock/)**: AWS Bedrock models - provider ID: `bedrock` - reference documentation: [bedrock](../../providers/inference/remote_bedrock.md)
-
-### Local/Remote Providers
- **[Ollama](https://ollama.ai/)**: Local Ollama models - provider ID: `ollama` - reference documentation: [ollama](../../providers/inference/remote_ollama.md)
- **[vLLM](https://docs.vllm.ai/en/latest/)**: Local or remote vLLM server - provider ID: `vllm` - reference documentation: [vllm](../../providers/inference/remote_vllm.md)
- **[TGI](https://github.com/huggingface/text-generation-inference)**: Text Generation Inference server - Dell Enterprise Hub's custom TGI container too (use `DEH_URL`) - provider ID: `tgi` - reference documentation: [tgi](../../providers/inference/remote_tgi.md)
- **[Sentence Transformers](https://www.sbert.net/)**: Local embedding models - provider ID: `sentence-transformers` - reference documentation: [sentence-transformers](../../providers/inference/inline_sentence-transformers.md)
-
-All providers are disabled by default. So you need to enable them by setting the environment variables.
-
-## Vector IO
-
-The starter distribution includes a comprehensive set of vector IO providers:
-
- **[FAISS](https://github.com/facebookresearch/faiss)**: Local FAISS vector store - enabled by
-  default - provider ID: `faiss`
- **[SQLite](https://www.sqlite.org/index.html)**: Local SQLite vector store - disabled by default - provider ID: `sqlite-vec`
- **[ChromaDB](https://www.trychroma.com/)**: Remote ChromaDB vector store - disabled by default - provider ID: `chromadb`
- **[PGVector](https://github.com/pgvector/pgvector)**: PostgreSQL vector store - disabled by default - provider ID: `pgvector`
- **[Milvus](https://milvus.io/)**: Milvus vector store - disabled by default - provider ID: `milvus`
-
-## Environment Variables
-
-The following environment variables can be configured:
-
-### Server Configuration
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
-
-### API Keys for Hosted Providers
- `OPENAI_API_KEY`: OpenAI API key
- `FIREWORKS_API_KEY`: Fireworks API key
- `TOGETHER_API_KEY`: Together API key
- `ANTHROPIC_API_KEY`: Anthropic API key
- `GEMINI_API_KEY`: Google Gemini API key
- `GROQ_API_KEY`: Groq API key
- `SAMBANOVA_API_KEY`: SambaNova API key
- `CEREBRAS_API_KEY`: Cerebras API key
- `LLAMA_API_KEY`: Llama API key
- `NVIDIA_API_KEY`: NVIDIA API key
- `HF_API_TOKEN`: HuggingFace API token
-
-### Local Provider Configuration
- `OLLAMA_URL`: Ollama server URL (default: `http://localhost:11434`)
- `VLLM_URL`: vLLM server URL (default: `http://localhost:8000/v1`)
- `VLLM_MAX_TOKENS`: vLLM max tokens (default: `4096`)
- `VLLM_API_TOKEN`: vLLM API token (default: `fake`)
- `VLLM_TLS_VERIFY`: vLLM TLS verification (default: `true`)
- `TGI_URL`: TGI server URL
-
-### Model Configuration
- `INFERENCE_MODEL`: HuggingFace model for serverless inference
- `INFERENCE_ENDPOINT_NAME`: HuggingFace endpoint name
-
-### Vector Database Configuration
- `SQLITE_STORE_DIR`: SQLite store directory (default: `~/.llama/distributions/starter`)
- `ENABLE_SQLITE_VEC`: Enable SQLite vector provider
- `ENABLE_CHROMADB`: Enable ChromaDB provider
- `ENABLE_PGVECTOR`: Enable PGVector provider
- `CHROMADB_URL`: ChromaDB server URL
- `PGVECTOR_HOST`: PGVector host (default: `localhost`)
- `PGVECTOR_PORT`: PGVector port (default: `5432`)
- `PGVECTOR_DB`: PGVector database name
- `PGVECTOR_USER`: PGVector username
- `PGVECTOR_PASSWORD`: PGVector password
-
-### Tool Configuration
- `BRAVE_SEARCH_API_KEY`: Brave Search API key
- `TAVILY_SEARCH_API_KEY`: Tavily Search API key
-
-### Telemetry Configuration
- `OTEL_SERVICE_NAME`: OpenTelemetry service name
- `TELEMETRY_SINKS`: Telemetry sinks (default: `console,sqlite`)
-
-## Enabling Providers
-
-You can enable specific providers by setting appropriate environment variables. For example,
-
-```bash
-# self-hosted
-export OLLAMA_URL=http://localhost:11434   # enables the Ollama inference provider
-export VLLM_URL=http://localhost:8000/v1   # enables the vLLM inference provider
-export TGI_URL=http://localhost:8000/v1   # enables the TGI inference provider
-
-# cloud-hosted requiring API key configuration on the server
-export CEREBRAS_API_KEY=your_cerebras_api_key   # enables the Cerebras inference provider
-export NVIDIA_API_KEY=your_nvidia_api_key   # enables the NVIDIA inference provider
-
-# vector providers
-export MILVUS_URL=http://localhost:19530   # enables the Milvus vector provider
-export CHROMADB_URL=http://localhost:8000/v1   # enables the ChromaDB vector provider
-export PGVECTOR_DB=llama_stack_db   # enables the PGVector vector provider
-```
-
-This distribution comes with a default "llama-guard" shield that can be enabled by setting the `SAFETY_MODEL` environment variable to point to an appropriate Llama Guard model id. Use `llama-stack-client models list` to see the list of available models.
-
-## Running the Distribution
-
-You can run the starter distribution via Docker or venv.
-
-### Via Docker
-
-This method allows you to get started quickly without having to build the distribution code.
-
-```bash
-LLAMA_STACK_PORT=8321
-docker run \
-  -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -e OPENAI_API_KEY=your_openai_key \
-  -e FIREWORKS_API_KEY=your_fireworks_key \
-  -e TOGETHER_API_KEY=your_together_key \
-  llamastack/distribution-starter \
-  --port $LLAMA_STACK_PORT
-```
-
-### Via venv
-
-Ensure you have configured the starter distribution using the environment variables explained above.
-
-```bash
-uv run --with llama-stack llama stack build --distro starter --image-type venv --run
-```
-
-## Example Usage
-
-Once the distribution is running, you can use any of the available models. Here are some examples:
-
-### Using OpenAI Models
-```bash
-llama-stack-client --endpoint http://localhost:8321 \
-inference chat-completion \
--model-id openai/gpt-4o \
--message "Hello, how are you?"
-```
-
-### Using Fireworks Models
-```bash
-llama-stack-client --endpoint http://localhost:8321 \
-inference chat-completion \
--model-id fireworks/meta-llama/Llama-3.2-3B-Instruct \
--message "Write a short story about a robot."
-```
-
-### Using Local Ollama Models
-```bash
-# First, make sure Ollama is running and you have a model
-ollama run llama3.2:3b
-
-# Then use it through Llama Stack
-export OLLAMA_INFERENCE_MODEL=llama3.2:3b
-llama-stack-client --endpoint http://localhost:8321 \
-inference chat-completion \
--model-id ollama/llama3.2:3b \
--message "Explain quantum computing in simple terms."
-```
-
-## Storage
-
-The starter distribution uses SQLite for local storage of various components:
-
- **Metadata store**: `~/.llama/distributions/starter/registry.db`
- **Inference store**: `~/.llama/distributions/starter/inference_store.db`
- **FAISS store**: `~/.llama/distributions/starter/faiss_store.db`
- **SQLite vector store**: `~/.llama/distributions/starter/sqlite_vec.db`
- **Files metadata**: `~/.llama/distributions/starter/files_metadata.db`
- **Agents store**: `~/.llama/distributions/starter/agents_store.db`
- **Responses store**: `~/.llama/distributions/starter/responses_store.db`
- **Trace store**: `~/.llama/distributions/starter/trace_store.db`
- **Evaluation store**: `~/.llama/distributions/starter/meta_reference_eval.db`
- **Dataset I/O stores**: Various HuggingFace and local filesystem stores
-
-## Benefits of the Starter Distribution
-
-1. **Comprehensive Coverage**: Includes most popular AI providers in one distribution
-2. **Flexible Configuration**: Easy to enable/disable providers based on your needs
-3. **No Local GPU Required**: Most providers are cloud-based, making it accessible to developers without high-end hardware
-4. **Easy Migration**: Start with hosted providers and gradually move to local ones as needed
-5. **Production Ready**: Includes safety, evaluation, and telemetry components
-6. **Tool Integration**: Comes with web search, RAG, and model context protocol tools
-
-The starter distribution is ideal for developers who want to experiment with different AI providers, build prototypes quickly, or create applications that can work with multiple AI backends.