feat: consolidate most distros into "starter" (#2516)

# What does this PR do? * Removes a bunch of distros * Removed distros were added into the "starter" distribution * Doc for "starter" has been added * Partially reverts https://github.com/meta-llama/llama-stack/pull/2482 since inference providers are disabled by default and can be turned on manually via env variable. * Disables safety in starter distro Closes: https://github.com/meta-llama/llama-stack/issues/2502. ~Needs: https://github.com/meta-llama/llama-stack/pull/2482 for Ollama to work properly in the CI.~ TODO: - [ ] We can only update `install.sh` when we get a new release. - [x] Update providers documentation - [ ] Update notebooks to reference starter instead of ollama Signed-off-by: Sébastien Han <seb@redhat.com>
2025-08-21 09:23:13 +00:00 · 2025-07-04 15:58:03 +02:00 · 2025-07-04 15:58:03 +02:00 · c4349f532b
commit c4349f532b
parent f77d4d91f5
132 changed files with 1009 additions and 10845 deletions
--- a/docs/source/distributions/building_distro.md
+++ b/docs/source/distributions/building_distro.md
@ -141,9 +141,9 @@ You may then pick a template to build your distribution with providers fitted to

 For example, to build a distribution with TGI as the inference provider, you can run:
 ```
-$ llama stack build --template tgi
+$ llama stack build --template starter
 ...
-You can now edit ~/.llama/distributions/llamastack-tgi/tgi-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-tgi/tgi-run.yaml`
+You can now edit ~/.llama/distributions/llamastack-starter/starter-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-starter/starter-run.yaml`
 ```
 :::
 :::{tab-item} Building from Scratch
@ -183,26 +183,7 @@ You can now edit ~/.llama/distributions/llamastack-my-local-stack/my-local-stack
 - The config file will be of contents like the ones in `llama_stack/templates/*build.yaml`.

 ```
-$ cat llama_stack/templates/ollama/build.yaml
-
-name: ollama
-distribution_spec:
-  description: Like local, but use ollama for running LLM inference
-  providers:
-    inference: remote::ollama
-    memory: inline::faiss
-    safety: inline::llama-guard
-    agents: inline::meta-reference
-    telemetry: inline::meta-reference
-image_name: ollama
-image_type: conda
-
-# If some providers are external, you can specify the path to the implementation
-external_providers_dir: ~/.llama/providers.d
-```
-
-```
-llama stack build --config llama_stack/templates/ollama/build.yaml
+llama stack build --config llama_stack/templates/starter/build.yaml
 ```
 :::

@ -268,11 +249,11 @@ Podman is supported as an alternative to Docker. Set `CONTAINER_BINARY` to `podm
 To build a container image, you may start off from a template and use the `--image-type container` flag to specify `container` as the build image type.

 ```
-llama stack build --template ollama --image-type container
+llama stack build --template starter --image-type container
 ```

 ```
-$ llama stack build --template ollama --image-type container
+$ llama stack build --template starter --image-type container
 ...
 Containerfile created successfully in /tmp/tmp.viA3a3Rdsg/ContainerfileFROM python:3.10-slim
 ...
--- a/docs/source/distributions/importing_as_library.md
+++ b/docs/source/distributions/importing_as_library.md
@ -6,7 +6,7 @@ This avoids the overhead of setting up a server.
 ```bash
 # setup
 uv pip install llama-stack
-llama stack build --template ollama --image-type venv
+llama stack build --template starter --image-type venv
 ```

 ```python
--- a/docs/source/distributions/list_of_distributions.md
+++ b/docs/source/distributions/list_of_distributions.md
@ -1,51 +1,94 @@
-# Available List of Distributions
+# Available Distributions

-Here are a list of distributions you can use to start a Llama Stack server that are provided out of the box.
+Llama Stack provides several pre-configured distributions to help you get started quickly. Choose the distribution that best fits your hardware and use case.

-## Selection of a Distribution / Template
+## Quick Reference

-Which templates / distributions to choose depends on the hardware you have for running LLM inference.
+| Distribution | Use Case | Hardware Requirements | Provider |
+|--------------|----------|----------------------|----------|
+| `distribution-starter` | General purpose, prototyping | Any (CPU/GPU) | Ollama, Remote APIs |
+| `distribution-meta-reference-gpu` | High-performance inference | GPU required | Local GPU inference |
+| Remote-hosted | Production, managed service | None | Partner providers |
+| iOS/Android SDK | Mobile applications | Mobile device | On-device inference |

- **Do you want a hosted Llama Stack endpoint?** If so, we suggest leveraging our partners who host Llama Stack endpoints. Namely, _fireworks.ai_ and _together.xyz_.
-  - Read more about it here - [Remote-Hosted Endpoints](remote_hosted_distro/index).
+## Choose Your Distribution

+### 🚀 Getting Started (Recommended for Beginners)

- **Do you have access to machines with GPUs?** If you wish to run Llama Stack locally or on a cloud instance and host your own Llama Stack endpoint, we suggest:
-  - {dockerhub}`distribution-remote-vllm` ([Guide](self_hosted_distro/remote-vllm))
-  - {dockerhub}`distribution-meta-reference-gpu` ([Guide](self_hosted_distro/meta-reference-gpu))
-  - {dockerhub}`distribution-tgi` ([Guide](self_hosted_distro/tgi))
-  - {dockerhub}`distribution-nvidia` ([Guide](self_hosted_distro/nvidia))
+**Use `distribution-starter` if you want to:**
+- Prototype quickly without GPU requirements
+- Use remote inference providers (Fireworks, Together, vLLM etc.)
+- Run locally with Ollama for development

- **Are you running on a "regular" desktop or laptop ?** We suggest using the ollama template for quick prototyping and get started without having to worry about needing GPUs.
-  - {dockerhub}`distribution-ollama` ([Guide](self_hosted_distro/ollama))
+```bash
+docker pull llama-stack/distribution-starter
+```

- **Do you have an API key for a remote inference provider like Fireworks, Together, etc.?**  If so, we suggest:
-  - {dockerhub}`distribution-together` ([Guide](self_hosted_distro/together))
-  - {dockerhub}`distribution-fireworks` ([Guide](self_hosted_distro/fireworks))
+**Guides:** [Starter Distribution Guide](self_hosted_distro/starter)

- **Do you want to run Llama Stack inference on your iOS / Android device?**  Lastly, we also provide templates for running Llama Stack inference on your iOS / Android device:
-  - [iOS SDK](ondevice_distro/ios_sdk)
-  - [Android](ondevice_distro/android_sdk)
+### 🖥️ Self-Hosted with GPU

+**Use `distribution-meta-reference-gpu` if you:**
+- Have access to GPU hardware
+- Want maximum performance and control
+- Need to run inference locally

- **If none of the above fit your needs, you can also build your own [custom distribution](building_distro.md).**
+```bash
+docker pull llama-stack/distribution-meta-reference-gpu
+```

-### Distribution Details
+**Guides:** [Meta Reference GPU Guide](self_hosted_distro/meta-reference-gpu)
+
+### ☁️ Managed Hosting
+
+**Use remote-hosted endpoints if you:**
+- Don't want to manage infrastructure
+- Need production-ready reliability
+- Prefer managed services
+
+**Partners:** [Fireworks.ai](https://fireworks.ai) and [Together.xyz](https://together.xyz)
+
+**Guides:** [Remote-Hosted Endpoints](remote_hosted_distro/index)
+
+### 📱 Mobile Development
+
+**Use mobile SDKs if you:**
+- Are building iOS or Android applications
+- Need on-device inference capabilities
+- Want offline functionality
+
+- [iOS SDK](ondevice_distro/ios_sdk)
+- [Android SDK](ondevice_distro/android_sdk)
+
+### 🔧 Custom Solutions
+
+**Build your own distribution if:**
+- None of the above fit your specific needs
+- You need custom configurations
+- You want to optimize for your specific use case
+
+**Guides:** [Building Custom Distributions](building_distro.md)
+
+## Detailed Documentation
+
+### Self-Hosted Distributions
+
+```{toctree}
+:maxdepth: 1
+
+self_hosted_distro/starter
+self_hosted_distro/meta-reference-gpu
+```
+
+### Remote-Hosted Solutions

 ```{toctree}
 :maxdepth: 1

 remote_hosted_distro/index
-self_hosted_distro/remote-vllm
-self_hosted_distro/meta-reference-gpu
-self_hosted_distro/tgi
-self_hosted_distro/nvidia
-self_hosted_distro/ollama
-self_hosted_distro/together
-self_hosted_distro/fireworks
 ```

-### On-Device Distributions
+### Mobile SDKs

 ```{toctree}
 :maxdepth: 1
@ -53,3 +96,25 @@ self_hosted_distro/fireworks
 ondevice_distro/ios_sdk
 ondevice_distro/android_sdk
 ```
+
+## Decision Flow
+
+```mermaid
+graph TD
+    A[What's your use case?] --> B{Need mobile app?}
+    B -->|Yes| C[Use Mobile SDKs]
+    B -->|No| D{Have GPU hardware?}
+    D -->|Yes| E[Use Meta Reference GPU]
+    D -->|No| F{Want managed hosting?}
+    F -->|Yes| G[Use Remote-Hosted]
+    F -->|No| H[Use Starter Distribution]
+```
+
+## Next Steps
+
+1. **Choose your distribution** from the options above
+2. **Follow the setup guide** for your selected distribution
+3. **Configure your providers** with API keys or local models
+4. **Start building** with Llama Stack!
+
+For help choosing or troubleshooting, check our [Getting Started Guide](../getting_started/index.md) or [Community Support](https://github.com/llama-stack/llama-stack/discussions).
--- a/docs/source/distributions/self_hosted_distro/bedrock.md
+++ b/docs/source/distributions/self_hosted_distro/bedrock.md
@ -1,79 +0,0 @@
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-# Bedrock Distribution
-
-```{toctree}
-:maxdepth: 2
-:hidden:
-
-self
-```
-
-The `llamastack/distribution-bedrock` distribution consists of the following provider configurations:
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| datasetio | `remote::huggingface`, `inline::localfs` |
-| eval | `inline::meta-reference` |
-| inference | `remote::bedrock` |
-| safety | `remote::bedrock` |
-| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
-| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
-
-
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
-
-### Models
-
-The following models are available by default:
-
- `meta.llama3-1-8b-instruct-v1:0 (aliases: meta-llama/Llama-3.1-8B-Instruct)`
- `meta.llama3-1-70b-instruct-v1:0 (aliases: meta-llama/Llama-3.1-70B-Instruct)`
- `meta.llama3-1-405b-instruct-v1:0 (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)`
-
-
-### Prerequisite: API Keys
-
-Make sure you have access to a AWS Bedrock API Key. You can get one by visiting [AWS Bedrock](https://aws.amazon.com/bedrock/).
-
-
-## Running Llama Stack with AWS Bedrock
-
-You can do this via Conda (build code) or Docker which has a pre-built image.
-
-### Via Docker
-
-This method allows you to get started quickly without having to build the distribution code.
-
-```bash
-LLAMA_STACK_PORT=8321
-docker run \
-  -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  llamastack/distribution-bedrock \
-  --port $LLAMA_STACK_PORT \
-  --env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
-  --env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
-  --env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
-  --env AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION
-```
-
-### Via Conda
-
-```bash
-llama stack build --template bedrock --image-type conda
-llama stack run ./run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
-  --env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
-  --env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
-  --env AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION
-```
--- a/docs/source/distributions/self_hosted_distro/cerebras.md
+++ b/docs/source/distributions/self_hosted_distro/cerebras.md
@ -1,67 +0,0 @@
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-# Cerebras Distribution
-
-The `llamastack/distribution-cerebras` distribution consists of the following provider configurations.
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| datasetio | `remote::huggingface`, `inline::localfs` |
-| eval | `inline::meta-reference` |
-| inference | `remote::cerebras`, `inline::sentence-transformers` |
-| safety | `inline::llama-guard` |
-| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime` |
-| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
-
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
- `CEREBRAS_API_KEY`: Cerebras API Key (default: ``)
-
-### Models
-
-The following models are available by default:
-
- `llama3.1-8b (aliases: meta-llama/Llama-3.1-8B-Instruct)`
- `llama-3.3-70b (aliases: meta-llama/Llama-3.3-70B-Instruct)`
-
-
-### Prerequisite: API Keys
-
-Make sure you have access to a Cerebras API Key. You can get one by visiting [cloud.cerebras.ai](https://cloud.cerebras.ai/).
-
-
-## Running Llama Stack with Cerebras
-
-You can do this via Conda (build code) or Docker which has a pre-built image.
-
-### Via Docker
-
-This method allows you to get started quickly without having to build the distribution code.
-
-```bash
-LLAMA_STACK_PORT=8321
-docker run \
-  -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v ./run.yaml:/root/my-run.yaml \
-  llamastack/distribution-cerebras \
-  --config /root/my-run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env CEREBRAS_API_KEY=$CEREBRAS_API_KEY
-```
-
-### Via Conda
-
-```bash
-llama stack build --template cerebras --image-type conda
-llama stack run ./run.yaml \
-  --port 8321 \
-  --env CEREBRAS_API_KEY=$CEREBRAS_API_KEY
-```
--- a/docs/source/distributions/self_hosted_distro/fireworks.md
+++ b/docs/source/distributions/self_hosted_distro/fireworks.md
@ -1,86 +0,0 @@
---
-orphan: true
---
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-# Fireworks Distribution
-
-```{toctree}
-:maxdepth: 2
-:hidden:
-
-self
-```
-
-The `llamastack/distribution-fireworks` distribution consists of the following provider configurations.
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| datasetio | `remote::huggingface`, `inline::localfs` |
-| eval | `inline::meta-reference` |
-| files | `inline::localfs` |
-| inference | `remote::fireworks`, `inline::sentence-transformers` |
-| safety | `inline::llama-guard` |
-| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `remote::wolfram-alpha`, `inline::rag-runtime`, `remote::model-context-protocol` |
-| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
-
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
- `FIREWORKS_API_KEY`: Fireworks.AI API Key (default: ``)
-
-### Models
-
-The following models are available by default:
-
- `accounts/fireworks/models/llama-v3p1-8b-instruct (aliases: meta-llama/Llama-3.1-8B-Instruct)`
- `accounts/fireworks/models/llama-v3p1-70b-instruct (aliases: meta-llama/Llama-3.1-70B-Instruct)`
- `accounts/fireworks/models/llama-v3p1-405b-instruct (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)`
- `accounts/fireworks/models/llama-v3p2-3b-instruct (aliases: meta-llama/Llama-3.2-3B-Instruct)`
- `accounts/fireworks/models/llama-v3p2-11b-vision-instruct (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)`
- `accounts/fireworks/models/llama-v3p2-90b-vision-instruct (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)`
- `accounts/fireworks/models/llama-v3p3-70b-instruct (aliases: meta-llama/Llama-3.3-70B-Instruct)`
- `accounts/fireworks/models/llama-guard-3-8b (aliases: meta-llama/Llama-Guard-3-8B)`
- `accounts/fireworks/models/llama-guard-3-11b-vision (aliases: meta-llama/Llama-Guard-3-11B-Vision)`
- `accounts/fireworks/models/llama4-scout-instruct-basic (aliases: meta-llama/Llama-4-Scout-17B-16E-Instruct)`
- `accounts/fireworks/models/llama4-maverick-instruct-basic (aliases: meta-llama/Llama-4-Maverick-17B-128E-Instruct)`
- `nomic-ai/nomic-embed-text-v1.5 `
-
-
-### Prerequisite: API Keys
-
-Make sure you have access to a Fireworks API Key. You can get one by visiting [fireworks.ai](https://fireworks.ai/).
-
-
-## Running Llama Stack with Fireworks
-
-You can do this via Conda (build code) or Docker which has a pre-built image.
-
-### Via Docker
-
-This method allows you to get started quickly without having to build the distribution code.
-
-```bash
-LLAMA_STACK_PORT=8321
-docker run \
-  -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  llamastack/distribution-fireworks \
-  --port $LLAMA_STACK_PORT \
-  --env FIREWORKS_API_KEY=$FIREWORKS_API_KEY
-```
-
-### Via Conda
-
-```bash
-llama stack build --template fireworks --image-type conda
-llama stack run ./run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env FIREWORKS_API_KEY=$FIREWORKS_API_KEY
-```
--- a/docs/source/distributions/self_hosted_distro/groq.md
+++ b/docs/source/distributions/self_hosted_distro/groq.md
@ -1,82 +0,0 @@
---
-orphan: true
---
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-# Groq Distribution
-
-```{toctree}
-:maxdepth: 2
-:hidden:
-
-self
-```
-
-The `llamastack/distribution-groq` distribution consists of the following provider configurations.
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| datasetio | `remote::huggingface`, `inline::localfs` |
-| eval | `inline::meta-reference` |
-| inference | `remote::groq` |
-| safety | `inline::llama-guard` |
-| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime` |
-| vector_io | `inline::faiss` |
-
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
- `GROQ_API_KEY`: Groq API Key (default: ``)
-
-### Models
-
-The following models are available by default:
-
- `groq/llama3-8b-8192 (aliases: meta-llama/Llama-3.1-8B-Instruct)`
- `groq/llama-3.1-8b-instant `
- `groq/llama3-70b-8192 (aliases: meta-llama/Llama-3-70B-Instruct)`
- `groq/llama-3.3-70b-versatile (aliases: meta-llama/Llama-3.3-70B-Instruct)`
- `groq/llama-3.2-3b-preview (aliases: meta-llama/Llama-3.2-3B-Instruct)`
- `groq/llama-4-scout-17b-16e-instruct (aliases: meta-llama/Llama-4-Scout-17B-16E-Instruct)`
- `groq/meta-llama/llama-4-scout-17b-16e-instruct (aliases: meta-llama/Llama-4-Scout-17B-16E-Instruct)`
- `groq/llama-4-maverick-17b-128e-instruct (aliases: meta-llama/Llama-4-Maverick-17B-128E-Instruct)`
- `groq/meta-llama/llama-4-maverick-17b-128e-instruct (aliases: meta-llama/Llama-4-Maverick-17B-128E-Instruct)`
-
-
-### Prerequisite: API Keys
-
-Make sure you have access to a Groq API Key. You can get one by visiting [Groq](https://api.groq.com/).
-
-
-## Running Llama Stack with Groq
-
-You can do this via Conda (build code) or Docker which has a pre-built image.
-
-### Via Docker
-
-This method allows you to get started quickly without having to build the distribution code.
-
-```bash
-LLAMA_STACK_PORT=8321
-docker run \
-  -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  llamastack/distribution-groq \
-  --port $LLAMA_STACK_PORT \
-  --env GROQ_API_KEY=$GROQ_API_KEY
-```
-
-### Via Conda
-
-```bash
-llama stack build --template groq --image-type conda
-llama stack run ./run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env GROQ_API_KEY=$GROQ_API_KEY
-```
--- a/docs/source/distributions/self_hosted_distro/nvidia.md
+++ b/docs/source/distributions/self_hosted_distro/nvidia.md
@ -1,177 +0,0 @@
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-# NVIDIA Distribution
-
-The `llamastack/distribution-nvidia` distribution consists of the following provider configurations.
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| datasetio | `inline::localfs`, `remote::nvidia` |
-| eval | `remote::nvidia` |
-| inference | `remote::nvidia` |
-| post_training | `remote::nvidia` |
-| safety | `remote::nvidia` |
-| scoring | `inline::basic` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `inline::rag-runtime` |
-| vector_io | `inline::faiss` |
-
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `NVIDIA_API_KEY`: NVIDIA API Key (default: ``)
- `NVIDIA_APPEND_API_VERSION`: Whether to append the API version to the base_url (default: `True`)
- `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`)
- `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`)
- `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`)
- `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`)
- `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`)
- `NVIDIA_GUARDRAILS_CONFIG_ID`: NVIDIA Guardrail Configuration ID (default: `self-check`)
- `NVIDIA_EVALUATOR_URL`: URL for the NeMo Evaluator Service (default: `http://0.0.0.0:7331`)
- `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`)
- `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`)
-
-### Models
-
-The following models are available by default:
-
- `meta/llama3-8b-instruct (aliases: meta-llama/Llama-3-8B-Instruct)`
- `meta/llama3-70b-instruct (aliases: meta-llama/Llama-3-70B-Instruct)`
- `meta/llama-3.1-8b-instruct (aliases: meta-llama/Llama-3.1-8B-Instruct)`
- `meta/llama-3.1-70b-instruct (aliases: meta-llama/Llama-3.1-70B-Instruct)`
- `meta/llama-3.1-405b-instruct (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)`
- `meta/llama-3.2-1b-instruct (aliases: meta-llama/Llama-3.2-1B-Instruct)`
- `meta/llama-3.2-3b-instruct (aliases: meta-llama/Llama-3.2-3B-Instruct)`
- `meta/llama-3.2-11b-vision-instruct (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)`
- `meta/llama-3.2-90b-vision-instruct (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)`
- `meta/llama-3.3-70b-instruct (aliases: meta-llama/Llama-3.3-70B-Instruct)`
- `nvidia/llama-3.2-nv-embedqa-1b-v2 `
- `nvidia/nv-embedqa-e5-v5 `
- `nvidia/nv-embedqa-mistral-7b-v2 `
- `snowflake/arctic-embed-l `
-
-
-## Prerequisites
-### NVIDIA API Keys
-
-Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable.
-
-### Deploy NeMo Microservices Platform
-The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for platform prerequisites and instructions to install and deploy the platform.
-
-## Supported Services
-Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints.
-
-### Inference: NVIDIA NIM
-NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs:
-  1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key)
-  2. Self-hosted: NVIDIA NIMs that run on your own infrastructure.
-
-The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment.
-
-### Datasetio API: NeMo Data Store
-The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint.
-
-See the {repopath}`NVIDIA Datasetio docs::llama_stack/providers/remote/datasetio/nvidia/README.md` for supported features and example usage.
-
-### Eval API: NeMo Evaluator
-The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint.
-
-See the {repopath}`NVIDIA Eval docs::llama_stack/providers/remote/eval/nvidia/README.md` for supported features and example usage.
-
-### Post-Training API: NeMo Customizer
-The NeMo Customizer microservice supports fine-tuning models. You can reference {repopath}`this list of supported models::llama_stack/providers/remote/post_training/nvidia/models.py` that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint.
-
-See the {repopath}`NVIDIA Post-Training docs::llama_stack/providers/remote/post_training/nvidia/README.md` for supported features and example usage.
-
-### Safety API: NeMo Guardrails
-The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint.
-
-See the {repopath}`NVIDIA Safety docs::llama_stack/providers/remote/safety/nvidia/README.md` for supported features and example usage.
-
-## Deploying models
-In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`.
-
-Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart.
-```sh
-# URL to NeMo NIM Proxy service
-export NEMO_URL="http://nemo.test"
-
-curl --location "$NEMO_URL/v1/deployment/model-deployments" \
-   -H 'accept: application/json' \
-   -H 'Content-Type: application/json' \
-   -d '{
-      "name": "llama-3.2-1b-instruct",
-      "namespace": "meta",
-      "config": {
-         "model": "meta/llama-3.2-1b-instruct",
-         "nim_deployment": {
-            "image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct",
-            "image_tag": "1.8.3",
-            "pvc_size": "25Gi",
-            "gpu": 1,
-            "additional_envs": {
-               "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
-            }
-         }
-      }
-   }'
-```
-This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html) for more information on how to deploy a NIM and verify it's available for inference.
-
-You can also remove a deployed NIM to free up GPU resources, if needed.
-```sh
-export NEMO_URL="http://nemo.test"
-
-curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct"
-```
-
-## Running Llama Stack with NVIDIA
-
-You can do this via Conda or venv (build code), or Docker which has a pre-built image.
-
-### Via Docker
-
-This method allows you to get started quickly without having to build the distribution code.
-
-```bash
-LLAMA_STACK_PORT=8321
-docker run \
-  -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v ./run.yaml:/root/my-run.yaml \
-  llamastack/distribution-nvidia \
-  --config /root/my-run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env NVIDIA_API_KEY=$NVIDIA_API_KEY
-```
-
-### Via Conda
-
-```bash
-INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct
-llama stack build --template nvidia --image-type conda
-llama stack run ./run.yaml \
-  --port 8321 \
-  --env NVIDIA_API_KEY=$NVIDIA_API_KEY \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL
-```
-
-### Via venv
-
-If you've set up your local development environment, you can also build the image using your local virtual environment.
-
-```bash
-INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct
-llama stack build --template nvidia --image-type venv
-llama stack run ./run.yaml \
-  --port 8321 \
-  --env NVIDIA_API_KEY=$NVIDIA_API_KEY \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL
-```
-
-## Example Notebooks
-For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in {repopath}`docs/notebooks/nvidia`.
--- a/docs/source/distributions/self_hosted_distro/ollama.md
+++ b/docs/source/distributions/self_hosted_distro/ollama.md
@ -1,165 +0,0 @@
---
-orphan: true
---
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-# Ollama Distribution
-
-```{toctree}
-:maxdepth: 2
-:hidden:
-
-self
-```
-
-The `llamastack/distribution-ollama` distribution consists of the following provider configurations.
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| datasetio | `remote::huggingface`, `inline::localfs` |
-| eval | `inline::meta-reference` |
-| files | `inline::localfs` |
-| inference | `remote::ollama` |
-| post_training | `inline::huggingface` |
-| safety | `inline::llama-guard` |
-| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol`, `remote::wolfram-alpha` |
-| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
-
-
-You should use this distribution if you have a regular desktop machine without very powerful GPUs. Of course, if you have powerful GPUs, you can still continue using this distribution since Ollama supports GPU acceleration.
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
- `OLLAMA_URL`: URL of the Ollama server (default: `http://127.0.0.1:11434`)
- `INFERENCE_MODEL`: Inference model loaded into the Ollama server (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `SAFETY_MODEL`: Safety model loaded into the Ollama server (default: `meta-llama/Llama-Guard-3-1B`)
-
-
-## Setting up Ollama server
-
-Please check the [Ollama Documentation](https://github.com/ollama/ollama) on how to install and run Ollama. After installing Ollama, you need to run `ollama serve` to start the server.
-
-In order to load models, you can run:
-
-```bash
-export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
-
-# ollama names this model differently, and we must use the ollama name when loading the model
-export OLLAMA_INFERENCE_MODEL="llama3.2:3b-instruct-fp16"
-ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m
-```
-
-If you are using Llama Stack Safety / Shield APIs, you will also need to pull and run the safety model.
-
-```bash
-export SAFETY_MODEL="meta-llama/Llama-Guard-3-1B"
-
-# ollama names this model differently, and we must use the ollama name when loading the model
-export OLLAMA_SAFETY_MODEL="llama-guard3:1b"
-ollama run $OLLAMA_SAFETY_MODEL --keepalive 60m
-```
-
-## Running Llama Stack
-
-Now you are ready to run Llama Stack with Ollama as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
-
-### Via Docker
-
-This method allows you to get started quickly without having to build the distribution code.
-
-```bash
-export LLAMA_STACK_PORT=8321
-docker run \
-  -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v ~/.llama:/root/.llama \
-  llamastack/distribution-ollama \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env OLLAMA_URL=http://host.docker.internal:11434
-```
-
-If you are using Llama Stack Safety / Shield APIs, use:
-
-```bash
-# You need a local checkout of llama-stack to run this, get it using
-# git clone https://github.com/meta-llama/llama-stack.git
-cd /path/to/llama-stack
-
-docker run \
-  -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v ~/.llama:/root/.llama \
-  -v ./llama_stack/templates/ollama/run-with-safety.yaml:/root/my-run.yaml \
-  llamastack/distribution-ollama \
-  --config /root/my-run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env SAFETY_MODEL=$SAFETY_MODEL \
-  --env OLLAMA_URL=http://host.docker.internal:11434
-```
-
-### Via Conda
-
-Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
-
-```bash
-export LLAMA_STACK_PORT=8321
-
-llama stack build --template ollama --image-type conda
-llama stack run ./run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env OLLAMA_URL=http://localhost:11434
-```
-
-If you are using Llama Stack Safety / Shield APIs, use:
-
-```bash
-llama stack run ./run-with-safety.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env SAFETY_MODEL=$SAFETY_MODEL \
-  --env OLLAMA_URL=http://localhost:11434
-```
-
-
-### (Optional) Update Model Serving Configuration
-
-```{note}
-Please check the [model_entries](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/inference/ollama/models.py) for the supported Ollama models.
-```
-
-To serve a new model with `ollama`
-```bash
-ollama run <model_name>
-```
-
-To make sure that the model is being served correctly, run `ollama ps` to get a list of models being served by ollama.
-```
-$ ollama ps
-NAME                         ID              SIZE      PROCESSOR    UNTIL
-llama3.2:3b-instruct-fp16    195a8c01d91e    8.6 GB    100% GPU     9 minutes from now
-```
-
-To verify that the model served by ollama is correctly connected to Llama Stack server
-```bash
-$ llama-stack-client models list
-
-Available Models
-
-┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┓
-┃ model_type   ┃ identifier                           ┃ provider_resource_id         ┃ metadata  ┃ provider_id ┃
-┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━┩
-│ llm          │ meta-llama/Llama-3.2-3B-Instruct     │ llama3.2:3b-instruct-fp16    │           │ ollama      │
-└──────────────┴──────────────────────────────────────┴──────────────────────────────┴───────────┴─────────────┘
-
-Total models: 1
-```
--- a/docs/source/distributions/self_hosted_distro/remote-vllm.md
+++ b/docs/source/distributions/self_hosted_distro/remote-vllm.md
@ -1,297 +0,0 @@
---
-orphan: true
---
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-# Remote vLLM Distribution
-```{toctree}
-:maxdepth: 2
-:hidden:
-
-self
-```
-
-The `llamastack/distribution-remote-vllm` distribution consists of the following provider configurations:
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| datasetio | `remote::huggingface`, `inline::localfs` |
-| eval | `inline::meta-reference` |
-| inference | `remote::vllm`, `inline::sentence-transformers` |
-| safety | `inline::llama-guard` |
-| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol`, `remote::wolfram-alpha` |
-| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
-
-
-You can use this distribution if you want to run an independent vLLM server for inference.
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
- `INFERENCE_MODEL`: Inference model loaded into the vLLM server (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `VLLM_URL`: URL of the vLLM server with the main inference model (default: `http://host.docker.internal:5100/v1`)
- `MAX_TOKENS`: Maximum number of tokens for generation (default: `4096`)
- `SAFETY_VLLM_URL`: URL of the vLLM server with the safety model (default: `http://host.docker.internal:5101/v1`)
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
-
-
-## Setting up vLLM server
-
-In the following sections, we'll use AMD, NVIDIA or Intel GPUs to serve as hardware accelerators for the vLLM
-server, which acts as both the LLM inference provider and the safety provider. Note that vLLM also
-[supports many other hardware accelerators](https://docs.vllm.ai/en/latest/getting_started/installation.html) and
-that we only use GPUs here for demonstration purposes. Note that if you run into issues, you can include the environment variable `--env VLLM_DEBUG_LOG_API_SERVER_RESPONSE=true` (available in vLLM v0.8.3 and above) in the `docker run` command to enable log response from API server for debugging.
-
-### Setting up vLLM server on AMD GPU
-
-AMD provides two main vLLM container options:
- rocm/vllm: Production-ready container
- rocm/vllm-dev: Development container with the latest vLLM features
-
-Please check the [Blog about ROCm vLLM Usage](https://rocm.blogs.amd.com/software-tools-optimization/vllm-container/README.html) to get more details.
-
-Here is a sample script to start a ROCm vLLM server locally via Docker:
-
-```bash
-export INFERENCE_PORT=8000
-export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
-export CUDA_VISIBLE_DEVICES=0
-export VLLM_DIMG="rocm/vllm-dev:main"
-
-docker run \
-    --pull always \
-    --ipc=host \
-    --privileged \
-    --shm-size 16g \
-    --device=/dev/kfd \
-    --device=/dev/dri \
-    --group-add video \
-    --cap-add=SYS_PTRACE \
-    --cap-add=CAP_SYS_ADMIN \
-    --security-opt seccomp=unconfined \
-    --security-opt apparmor=unconfined \
-    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-    --env "HIP_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" \
-    -p $INFERENCE_PORT:$INFERENCE_PORT \
-    -v ~/.cache/huggingface:/root/.cache/huggingface \
-    $VLLM_DIMG \
-    python -m vllm.entrypoints.openai.api_server \
-    --model $INFERENCE_MODEL \
-    --port $INFERENCE_PORT
-```
-
-Note that you'll also need to set `--enable-auto-tool-choice` and `--tool-call-parser` to [enable tool calling in vLLM](https://docs.vllm.ai/en/latest/features/tool_calling.html).
-
-If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
-
-```bash
-export SAFETY_PORT=8081
-export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
-export CUDA_VISIBLE_DEVICES=1
-export VLLM_DIMG="rocm/vllm-dev:main"
-
-docker run \
-    --pull always \
-    --ipc=host \
-    --privileged \
-    --shm-size 16g \
-    --device=/dev/kfd \
-    --device=/dev/dri \
-    --group-add video \
-    --cap-add=SYS_PTRACE \
-    --cap-add=CAP_SYS_ADMIN \
-    --security-opt seccomp=unconfined \
-    --security-opt apparmor=unconfined \
-    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-    --env "HIP_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" \
-    -p $SAFETY_PORT:$SAFETY_PORT \
-    -v ~/.cache/huggingface:/root/.cache/huggingface \
-    $VLLM_DIMG \
-    python -m vllm.entrypoints.openai.api_server \
-    --model $SAFETY_MODEL \
-    --port $SAFETY_PORT
-```
-
-### Setting up vLLM server on NVIDIA GPU
-
-Please check the [vLLM Documentation](https://docs.vllm.ai/en/v0.5.5/serving/deploying_with_docker.html) to get a vLLM endpoint. Here is a sample script to start a vLLM server locally via Docker:
-
-```bash
-export INFERENCE_PORT=8000
-export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
-export CUDA_VISIBLE_DEVICES=0
-
-docker run \
-    --pull always \
-    --runtime nvidia \
-    --gpus $CUDA_VISIBLE_DEVICES \
-    -v ~/.cache/huggingface:/root/.cache/huggingface \
-    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-    -p $INFERENCE_PORT:$INFERENCE_PORT \
-    --ipc=host \
-    vllm/vllm-openai:latest \
-    --gpu-memory-utilization 0.7 \
-    --model $INFERENCE_MODEL \
-    --port $INFERENCE_PORT
-```
-
-Note that you'll also need to set `--enable-auto-tool-choice` and `--tool-call-parser` to [enable tool calling in vLLM](https://docs.vllm.ai/en/latest/features/tool_calling.html).
-
-If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
-
-```bash
-export SAFETY_PORT=8081
-export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
-export CUDA_VISIBLE_DEVICES=1
-
-docker run \
-    --pull always \
-    --runtime nvidia \
-    --gpus $CUDA_VISIBLE_DEVICES \
-    -v ~/.cache/huggingface:/root/.cache/huggingface \
-    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-    -p $SAFETY_PORT:$SAFETY_PORT \
-    --ipc=host \
-    vllm/vllm-openai:latest \
-    --gpu-memory-utilization 0.7 \
-    --model $SAFETY_MODEL \
-    --port $SAFETY_PORT
-```
-
-### Setting up vLLM server on Intel GPU
-
-Refer to [vLLM Documentation for XPU](https://docs.vllm.ai/en/v0.8.2/getting_started/installation/gpu.html?device=xpu) to get a vLLM endpoint. In addition to vLLM side setup which guides towards installing vLLM from sources orself-building vLLM Docker container, Intel provides prebuilt vLLM container to use on systems with Intel GPUs supported by PyTorch XPU backend:
- [intel/vllm](https://hub.docker.com/r/intel/vllm)
-
-Here is a sample script to start a vLLM server locally via Docker using Intel provided container:
-
-```bash
-export INFERENCE_PORT=8000
-export INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct
-export ZE_AFFINITY_MASK=0
-
-docker run \
-    --pull always \
-    --device /dev/dri \
-    -v /dev/dri/by-path:/dev/dri/by-path \
-    -v ~/.cache/huggingface:/root/.cache/huggingface \
-    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-    --env ZE_AFFINITY_MASK=$ZE_AFFINITY_MASK \
-    -p $INFERENCE_PORT:$INFERENCE_PORT \
-    --ipc=host \
-    intel/vllm:xpu \
-    --gpu-memory-utilization 0.7 \
-    --model $INFERENCE_MODEL \
-    --port $INFERENCE_PORT
-```
-
-If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
-
-```bash
-export SAFETY_PORT=8081
-export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
-export ZE_AFFINITY_MASK=1
-
-docker run \
-    --pull always \
-    --device /dev/dri \
-    -v /dev/dri/by-path:/dev/dri/by-path \
-    -v ~/.cache/huggingface:/root/.cache/huggingface \
-    --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
-    --env ZE_AFFINITY_MASK=$ZE_AFFINITY_MASK \
-    -p $SAFETY_PORT:$SAFETY_PORT \
-    --ipc=host \
-    intel/vllm:xpu \
-    --gpu-memory-utilization 0.7 \
-    --model $SAFETY_MODEL \
-    --port $SAFETY_PORT
-```
-
-## Running Llama Stack
-
-Now you are ready to run Llama Stack with vLLM as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
-
-### Via Docker
-
-This method allows you to get started quickly without having to build the distribution code.
-
-```bash
-export INFERENCE_PORT=8000
-export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
-export LLAMA_STACK_PORT=8321
-
-# You need a local checkout of llama-stack to run this, get it using
-# git clone https://github.com/meta-llama/llama-stack.git
-cd /path/to/llama-stack
-
-docker run \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v ./llama_stack/templates/remote-vllm/run.yaml:/root/my-run.yaml \
-  llamastack/distribution-remote-vllm \
-  --config /root/my-run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env VLLM_URL=http://host.docker.internal:$INFERENCE_PORT/v1
-```
-
-If you are using Llama Stack Safety / Shield APIs, use:
-
-```bash
-export SAFETY_PORT=8081
-export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
-
-# You need a local checkout of llama-stack to run this, get it using
-# git clone https://github.com/meta-llama/llama-stack.git
-cd /path/to/llama-stack
-
-docker run \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v ~/.llama:/root/.llama \
-  -v ./llama_stack/templates/remote-vllm/run-with-safety.yaml:/root/my-run.yaml \
-  llamastack/distribution-remote-vllm \
-  --config /root/my-run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env VLLM_URL=http://host.docker.internal:$INFERENCE_PORT/v1 \
-  --env SAFETY_MODEL=$SAFETY_MODEL \
-  --env SAFETY_VLLM_URL=http://host.docker.internal:$SAFETY_PORT/v1
-```
-
-
-### Via Conda
-
-Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
-
-```bash
-export INFERENCE_PORT=8000
-export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
-export LLAMA_STACK_PORT=8321
-
-cd distributions/remote-vllm
-llama stack build --template remote-vllm --image-type conda
-
-llama stack run ./run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env VLLM_URL=http://localhost:$INFERENCE_PORT/v1
-```
-
-If you are using Llama Stack Safety / Shield APIs, use:
-
-```bash
-export SAFETY_PORT=8081
-export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
-
-llama stack run ./run-with-safety.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env VLLM_URL=http://localhost:$INFERENCE_PORT/v1 \
-  --env SAFETY_MODEL=$SAFETY_MODEL \
-  --env SAFETY_VLLM_URL=http://localhost:$SAFETY_PORT/v1
-```
--- a/docs/source/distributions/self_hosted_distro/sambanova.md
+++ b/docs/source/distributions/self_hosted_distro/sambanova.md
@ -1,91 +0,0 @@
---
-orphan: true
---
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-# SambaNova Distribution
-
-```{toctree}
-:maxdepth: 2
-:hidden:
-
-self
-```
-
-The `llamastack/distribution-sambanova` distribution consists of the following provider configurations.
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| inference | `remote::sambanova`, `inline::sentence-transformers` |
-| safety | `remote::sambanova` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol`, `remote::wolfram-alpha` |
-| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
-
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
- `SAMBANOVA_API_KEY`: SambaNova API Key (default: ``)
-
-### Models
-
-The following models are available by default:
-
- `sambanova/Meta-Llama-3.1-8B-Instruct (aliases: meta-llama/Llama-3.1-8B-Instruct)`
- `sambanova/Meta-Llama-3.1-405B-Instruct (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)`
- `sambanova/Meta-Llama-3.2-1B-Instruct (aliases: meta-llama/Llama-3.2-1B-Instruct)`
- `sambanova/Meta-Llama-3.2-3B-Instruct (aliases: meta-llama/Llama-3.2-3B-Instruct)`
- `sambanova/Meta-Llama-3.3-70B-Instruct (aliases: meta-llama/Llama-3.3-70B-Instruct)`
- `sambanova/Llama-3.2-11B-Vision-Instruct (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)`
- `sambanova/Llama-3.2-90B-Vision-Instruct (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)`
- `sambanova/Llama-4-Scout-17B-16E-Instruct (aliases: meta-llama/Llama-4-Scout-17B-16E-Instruct)`
- `sambanova/Llama-4-Maverick-17B-128E-Instruct (aliases: meta-llama/Llama-4-Maverick-17B-128E-Instruct)`
- `sambanova/Meta-Llama-Guard-3-8B (aliases: meta-llama/Llama-Guard-3-8B)`
-
-
-### Prerequisite: API Keys
-
-Make sure you have access to a SambaNova API Key. You can get one by visiting [SambaNova.ai](http://cloud.sambanova.ai?utm_source=llamastack&utm_medium=external&utm_campaign=cloud_signup).
-
-
-## Running Llama Stack with SambaNova
-
-You can do this via Conda (build code) or Docker which has a pre-built image.
-
-
-### Via Docker
-
-```bash
-LLAMA_STACK_PORT=8321
-llama stack build --template sambanova --image-type container
-docker run \
-  -it \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v ~/.llama:/root/.llama \
-  distribution-sambanova \
-  --port $LLAMA_STACK_PORT \
-  --env SAMBANOVA_API_KEY=$SAMBANOVA_API_KEY
-```
-
-
-### Via Venv
-
-```bash
-llama stack build --template sambanova --image-type venv
-llama stack run --image-type venv ~/.llama/distributions/sambanova/sambanova-run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env SAMBANOVA_API_KEY=$SAMBANOVA_API_KEY
-```
-
-
-### Via Conda
-
-```bash
-llama stack build --template sambanova --image-type conda
-llama stack run --image-type conda ~/.llama/distributions/sambanova/sambanova-run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env SAMBANOVA_API_KEY=$SAMBANOVA_API_KEY
-```
--- a/docs/source/distributions/self_hosted_distro/starter.md
+++ b/docs/source/distributions/self_hosted_distro/starter.md
@ -0,0 +1,259 @@
+---
+orphan: true
+---
+<!-- This file was auto-generated by distro_codegen.py, please edit source -->
+# Starter Distribution
+
+```{toctree}
+:maxdepth: 2
+:hidden:
+
+self
+```
+
+The `llamastack/distribution-starter` distribution is a comprehensive, multi-provider distribution that includes most of the available inference providers in Llama Stack. It's designed to be a one-stop solution for developers who want to experiment with different AI providers without having to configure each one individually.
+
+## Provider Composition
+
+The starter distribution consists of the following provider configurations:
+
+| API | Provider(s) |
+|-----|-------------|
+| agents | `inline::meta-reference` |
+| datasetio | `remote::huggingface`, `inline::localfs` |
+| eval | `inline::meta-reference` |
+| files | `inline::localfs` |
+| inference | `remote::openai`, `remote::fireworks`, `remote::together`, `remote::ollama`, `remote::anthropic`, `remote::gemini`, `remote::groq`, `remote::sambanova`, `remote::vllm`, `remote::tgi`, `remote::cerebras`, `remote::llama-openai-compat`, `remote::nvidia`, `remote::hf::serverless`, `remote::hf::endpoint`, `inline::sentence-transformers` |
+| safety | `inline::llama-guard` |
+| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
+| telemetry | `inline::meta-reference` |
+| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
+| vector_io | `inline::faiss`, `inline::sqlite-vec`, `remote::chromadb`, `remote::pgvector` |
+
+## Inference Providers
+
+The starter distribution includes a comprehensive set of inference providers:
+
+### Hosted Providers
+- **[OpenAI](https://openai.com/api/)**: GPT-4, GPT-3.5, O1, O3, O4 models and text embeddings -
+  provider ID: `openai` - reference documentation: [openai](../../providers/inference/remote_openai.md)
+- **[Fireworks](https://fireworks.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and
+  embeddings - provider ID: `fireworks` - reference documentation: [fireworks](../../providers/inference/remote_fireworks.md)
+- **[Together](https://together.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and
+  embeddings - provider ID: `together` - reference documentation: [together](../../providers/inference/remote_together.md)
+- **[Anthropic](https://www.anthropic.com/)**: Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Haiku, and Voyage embeddings - provider ID: `anthropic` - reference documentation: [anthropic](../../providers/inference/remote_anthropic.md)
+- **[Gemini](https://gemini.google.com/)**: Gemini 1.5, 2.0, 2.5 models and text embeddings - provider ID: `gemini` - reference documentation: [gemini](../../providers/inference/remote_gemini.md)
+- **[Groq](https://groq.com/)**: Fast Llama models (3.1, 3.2, 3.3, 4 Scout, 4 Maverick) - provider ID: `groq` - reference documentation: [groq](../../providers/inference/remote_groq.md)
+- **[SambaNova](https://www.sambanova.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models - provider ID: `sambanova` - reference documentation: [sambanova](../../providers/inference/remote_sambanova.md)
+- **[Cerebras](https://www.cerebras.ai/)**: Cerebras AI models - provider ID: `cerebras` - reference documentation: [cerebras](../../providers/inference/remote_cerebras.md)
+- **[NVIDIA](https://www.nvidia.com/)**: NVIDIA NIM - provider ID: `nvidia` - reference documentation: [nvidia](../../providers/inference/remote_nvidia.md)
+- **[HuggingFace](https://huggingface.co/)**: Serverless and endpoint models - provider ID: `hf::serverless` and `hf::endpoint` - reference documentation: [huggingface-serverless](../../providers/inference/remote_hf_serverless.md) and [huggingface-endpoint](../../providers/inference/remote_hf_endpoint.md)
+- **[Bedrock](https://aws.amazon.com/bedrock/)**: AWS Bedrock models - provider ID: `bedrock` - reference documentation: [bedrock](../../providers/inference/remote_bedrock.md)
+
+### Local/Remote Providers
+- **[Ollama](https://ollama.ai/)**: Local Ollama models - provider ID: `ollama` - reference documentation: [ollama](../../providers/inference/remote_ollama.md)
+- **[vLLM](https://docs.vllm.ai/en/latest/)**: Local or remote vLLM server - provider ID: `vllm` - reference documentation: [vllm](../../providers/inference/remote_vllm.md)
+- **[TGI](https://github.com/huggingface/text-generation-inference)**: Text Generation Inference server - Dell Enterprise Hub's custom TGI container too (use `DEH_URL`) - provider ID: `tgi` - reference documentation: [tgi](../../providers/inference/remote_tgi.md)
+- **[Sentence Transformers](https://www.sbert.net/)**: Local embedding models - provider ID: `sentence-transformers` - reference documentation: [sentence-transformers](../../providers/inference/inline_sentence-transformers.md)
+
+All providers are disabled by default. So you need to enable them by setting the environment variables.
+
+## Vector IO
+
+The starter distribution includes a comprehensive set of vector IO providers:
+
+- **[FAISS](https://github.com/facebookresearch/faiss)**: Local FAISS vector store - enabled by
+  default - provider ID: `faiss`
+- **[SQLite](https://www.sqlite.org/index.html)**: Local SQLite vector store - disabled by default - provider ID: `sqlite-vec`
+- **[ChromaDB](https://www.trychroma.com/)**: Remote ChromaDB vector store - disabled by default - provider ID: `chromadb`
+- **[PGVector](https://github.com/pgvector/pgvector)**: PostgreSQL vector store - disabled by default - provider ID: `pgvector`
+- **[Milvus](https://milvus.io/)**: Milvus vector store - disabled by default - provider ID: `milvus`
+
+## Environment Variables
+
+The following environment variables can be configured:
+
+### Server Configuration
+- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
+
+### API Keys for Hosted Providers
+- `OPENAI_API_KEY`: OpenAI API key
+- `FIREWORKS_API_KEY`: Fireworks API key
+- `TOGETHER_API_KEY`: Together API key
+- `ANTHROPIC_API_KEY`: Anthropic API key
+- `GEMINI_API_KEY`: Google Gemini API key
+- `GROQ_API_KEY`: Groq API key
+- `SAMBANOVA_API_KEY`: SambaNova API key
+- `CEREBRAS_API_KEY`: Cerebras API key
+- `LLAMA_API_KEY`: Llama API key
+- `NVIDIA_API_KEY`: NVIDIA API key
+- `HF_API_TOKEN`: HuggingFace API token
+
+### Local Provider Configuration
+- `OLLAMA_URL`: Ollama server URL (default: `http://localhost:11434`)
+- `VLLM_URL`: vLLM server URL (default: `http://localhost:8000/v1`)
+- `VLLM_MAX_TOKENS`: vLLM max tokens (default: `4096`)
+- `VLLM_API_TOKEN`: vLLM API token (default: `fake`)
+- `VLLM_TLS_VERIFY`: vLLM TLS verification (default: `true`)
+- `TGI_URL`: TGI server URL
+
+### Model Configuration
+- `INFERENCE_MODEL`: HuggingFace model for serverless inference
+- `INFERENCE_ENDPOINT_NAME`: HuggingFace endpoint name
+- `OLLAMA_INFERENCE_MODEL`: Ollama model name
+- `OLLAMA_EMBEDDING_MODEL`: Ollama embedding model name
+- `OLLAMA_EMBEDDING_DIMENSION`: Ollama embedding dimension (default: `384`)
+- `VLLM_INFERENCE_MODEL`: vLLM model name
+
+### Vector Database Configuration
+- `SQLITE_STORE_DIR`: SQLite store directory (default: `~/.llama/distributions/starter`)
+- `ENABLE_SQLITE_VEC`: Enable SQLite vector provider
+- `ENABLE_CHROMADB`: Enable ChromaDB provider
+- `ENABLE_PGVECTOR`: Enable PGVector provider
+- `CHROMADB_URL`: ChromaDB server URL
+- `PGVECTOR_HOST`: PGVector host (default: `localhost`)
+- `PGVECTOR_PORT`: PGVector port (default: `5432`)
+- `PGVECTOR_DB`: PGVector database name
+- `PGVECTOR_USER`: PGVector username
+- `PGVECTOR_PASSWORD`: PGVector password
+
+### Tool Configuration
+- `BRAVE_SEARCH_API_KEY`: Brave Search API key
+- `TAVILY_SEARCH_API_KEY`: Tavily Search API key
+
+### Telemetry Configuration
+- `OTEL_SERVICE_NAME`: OpenTelemetry service name
+- `TELEMETRY_SINKS`: Telemetry sinks (default: `console,sqlite`)
+
+## Enabling Providers
+
+You can enable specific providers by setting their provider ID to a valid value using environment variables. This is useful when you want to use certain providers or don't have the required API keys.
+
+### Examples of Enabling Providers
+
+#### Enable FAISS Vector Provider
+```bash
+export ENABLE_FAISS=faiss
+```
+
+#### Enable Ollama Models
+```bash
+export ENABLE_OLLAMA=ollama
+```
+
+#### Disable vLLM Models
+```bash
+export VLLM_INFERENCE_MODEL=__disabled__
+```
+
+#### Disable Optional Vector Providers
+```bash
+export ENABLE_SQLITE_VEC=__disabled__
+export ENABLE_CHROMADB=__disabled__
+export ENABLE_PGVECTOR=__disabled__
+```
+
+### Provider ID Patterns
+
+The starter distribution uses several patterns for provider IDs:
+
+1. **Direct provider IDs**: `faiss`, `ollama`, `vllm`
+2. **Environment-based provider IDs**: `${env.ENABLE_SQLITE_VEC+sqlite-vec}`
+3. **Model-based provider IDs**: `${env.OLLAMA_INFERENCE_MODEL:__disabled__}`
+
+When using the `+` pattern (like `${env.ENABLE_SQLITE_VEC+sqlite-vec}`), the provider is enabled by default and can be disabled by setting the environment variable to `__disabled__`.
+
+When using the `:` pattern (like `${env.OLLAMA_INFERENCE_MODEL:__disabled__}`), the provider is disabled by default and can be enabled by setting the environment variable to a valid value.
+
+## Running the Distribution
+
+You can run the starter distribution via Docker or Conda.
+
+### Via Docker
+
+This method allows you to get started quickly without having to build the distribution code.
+
+```bash
+LLAMA_STACK_PORT=8321
+docker run \
+  -it \
+  --pull always \
+  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
+  -e OPENAI_API_KEY=your_openai_key \
+  -e FIREWORKS_API_KEY=your_fireworks_key \
+  -e TOGETHER_API_KEY=your_together_key \
+  llamastack/distribution-starter \
+  --port $LLAMA_STACK_PORT
+```
+
+### Via Conda
+
+Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
+
+```bash
+llama stack build --template starter --image-type conda
+llama stack run distributions/starter/run.yaml \
+  --port 8321 \
+  --env OPENAI_API_KEY=your_openai_key \
+  --env FIREWORKS_API_KEY=your_fireworks_key \
+  --env TOGETHER_API_KEY=your_together_key
+```
+
+## Example Usage
+
+Once the distribution is running, you can use any of the available models. Here are some examples:
+
+### Using OpenAI Models
+```bash
+llama-stack-client --endpoint http://localhost:8321 \
+inference chat-completion \
+--model-id openai/gpt-4o \
+--message "Hello, how are you?"
+```
+
+### Using Fireworks Models
+```bash
+llama-stack-client --endpoint http://localhost:8321 \
+inference chat-completion \
+--model-id fireworks/meta-llama/Llama-3.2-3B-Instruct \
+--message "Write a short story about a robot."
+```
+
+### Using Local Ollama Models
+```bash
+# First, make sure Ollama is running and you have a model
+ollama run llama3.2:3b
+
+# Then use it through Llama Stack
+export OLLAMA_INFERENCE_MODEL=llama3.2:3b
+llama-stack-client --endpoint http://localhost:8321 \
+inference chat-completion \
+--model-id ollama/llama3.2:3b \
+--message "Explain quantum computing in simple terms."
+```
+
+## Storage
+
+The starter distribution uses SQLite for local storage of various components:
+
+- **Metadata store**: `~/.llama/distributions/starter/registry.db`
+- **Inference store**: `~/.llama/distributions/starter/inference_store.db`
+- **FAISS store**: `~/.llama/distributions/starter/faiss_store.db`
+- **SQLite vector store**: `~/.llama/distributions/starter/sqlite_vec.db`
+- **Files metadata**: `~/.llama/distributions/starter/files_metadata.db`
+- **Agents store**: `~/.llama/distributions/starter/agents_store.db`
+- **Responses store**: `~/.llama/distributions/starter/responses_store.db`
+- **Trace store**: `~/.llama/distributions/starter/trace_store.db`
+- **Evaluation store**: `~/.llama/distributions/starter/meta_reference_eval.db`
+- **Dataset I/O stores**: Various HuggingFace and local filesystem stores
+
+## Benefits of the Starter Distribution
+
+1. **Comprehensive Coverage**: Includes most popular AI providers in one distribution
+2. **Flexible Configuration**: Easy to enable/disable providers based on your needs
+3. **No Local GPU Required**: Most providers are cloud-based, making it accessible to developers without high-end hardware
+4. **Easy Migration**: Start with hosted providers and gradually move to local ones as needed
+5. **Production Ready**: Includes safety, evaluation, and telemetry components
+6. **Tool Integration**: Comes with web search, RAG, and model context protocol tools
+
+The starter distribution is ideal for developers who want to experiment with different AI providers, build prototypes quickly, or create applications that can work with multiple AI backends.
--- a/docs/source/distributions/self_hosted_distro/tgi.md
+++ b/docs/source/distributions/self_hosted_distro/tgi.md
@ -1,149 +0,0 @@
---
-orphan: true
---
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-
-# TGI Distribution
-
-```{toctree}
-:maxdepth: 2
-:hidden:
-
-self
-```
-
-The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| datasetio | `remote::huggingface`, `inline::localfs` |
-| eval | `inline::meta-reference` |
-| inference | `remote::tgi`, `inline::sentence-transformers` |
-| safety | `inline::llama-guard` |
-| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
-| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
-
-
-You can use this distribution if you have GPUs and want to run an independent TGI server container for running inference.
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
- `INFERENCE_MODEL`: Inference model loaded into the TGI server (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `TGI_URL`: URL of the TGI server with the main inference model (default: `http://127.0.0.1:8080/v1`)
- `TGI_SAFETY_URL`: URL of the TGI server with the safety model (default: `http://127.0.0.1:8081/v1`)
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
-
-
-## Setting up TGI server
-
-Please check the [TGI Getting Started Guide](https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#get-started) to get a TGI endpoint. Here is a sample script to start a TGI server locally via Docker:
-
-```bash
-export INFERENCE_PORT=8080
-export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
-export CUDA_VISIBLE_DEVICES=0
-
-docker run --rm -it \
-  --pull always \
-  -v $HOME/.cache/huggingface:/data \
-  -p $INFERENCE_PORT:$INFERENCE_PORT \
-  --gpus $CUDA_VISIBLE_DEVICES \
-  ghcr.io/huggingface/text-generation-inference:2.3.1 \
-  --dtype bfloat16 \
-  --usage-stats off \
-  --sharded false \
-  --cuda-memory-fraction 0.7 \
-  --model-id $INFERENCE_MODEL \
-  --port $INFERENCE_PORT
-```
-
-If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
-
-```bash
-export SAFETY_PORT=8081
-export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
-export CUDA_VISIBLE_DEVICES=1
-
-docker run --rm -it \
-  --pull always \
-  -v $HOME/.cache/huggingface:/data \
-  -p $SAFETY_PORT:$SAFETY_PORT \
-  --gpus $CUDA_VISIBLE_DEVICES \
-  ghcr.io/huggingface/text-generation-inference:2.3.1 \
-  --dtype bfloat16 \
-  --usage-stats off \
-  --sharded false \
-  --model-id $SAFETY_MODEL \
-  --port $SAFETY_PORT
-```
-
-## Running Llama Stack
-
-Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
-
-### Via Docker
-
-This method allows you to get started quickly without having to build the distribution code.
-
-```bash
-LLAMA_STACK_PORT=8321
-docker run \
-  -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  llamastack/distribution-tgi \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env TGI_URL=http://host.docker.internal:$INFERENCE_PORT
-```
-
-If you are using Llama Stack Safety / Shield APIs, use:
-
-```bash
-# You need a local checkout of llama-stack to run this, get it using
-# git clone https://github.com/meta-llama/llama-stack.git
-cd /path/to/llama-stack
-
-docker run \
-  -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  -v ~/.llama:/root/.llama \
-  -v ./llama_stack/templates/tgi/run-with-safety.yaml:/root/my-run.yaml \
-  llamastack/distribution-tgi \
-  --config /root/my-run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env TGI_URL=http://host.docker.internal:$INFERENCE_PORT \
-  --env SAFETY_MODEL=$SAFETY_MODEL \
-  --env TGI_SAFETY_URL=http://host.docker.internal:$SAFETY_PORT
-```
-
-### Via Conda
-
-Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
-
-```bash
-llama stack build --template tgi --image-type conda
-llama stack run ./run.yaml
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env TGI_URL=http://127.0.0.1:$INFERENCE_PORT
-```
-
-If you are using Llama Stack Safety / Shield APIs, use:
-
-```bash
-llama stack run ./run-with-safety.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env INFERENCE_MODEL=$INFERENCE_MODEL \
-  --env TGI_URL=http://127.0.0.1:$INFERENCE_PORT \
-  --env SAFETY_MODEL=$SAFETY_MODEL \
-  --env TGI_SAFETY_URL=http://127.0.0.1:$SAFETY_PORT
-```
--- a/docs/source/distributions/self_hosted_distro/together.md
+++ b/docs/source/distributions/self_hosted_distro/together.md
@ -1,86 +0,0 @@
---
-orphan: true
---
-<!-- This file was auto-generated by distro_codegen.py, please edit source -->
-# Together Distribution
-
-```{toctree}
-:maxdepth: 2
-:hidden:
-
-self
-```
-
-The `llamastack/distribution-together` distribution consists of the following provider configurations.
-
-| API | Provider(s) |
-|-----|-------------|
-| agents | `inline::meta-reference` |
-| datasetio | `remote::huggingface`, `inline::localfs` |
-| eval | `inline::meta-reference` |
-| inference | `remote::together`, `inline::sentence-transformers` |
-| safety | `inline::llama-guard` |
-| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
-| telemetry | `inline::meta-reference` |
-| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol`, `remote::wolfram-alpha` |
-| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
-
-
-### Environment Variables
-
-The following environment variables can be configured:
-
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
- `TOGETHER_API_KEY`: Together.AI API Key (default: ``)
-
-### Models
-
-The following models are available by default:
-
- `meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo (aliases: meta-llama/Llama-3.1-8B-Instruct)`
- `meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo (aliases: meta-llama/Llama-3.1-70B-Instruct)`
- `meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)`
- `meta-llama/Llama-3.2-3B-Instruct-Turbo (aliases: meta-llama/Llama-3.2-3B-Instruct)`
- `meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)`
- `meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)`
- `meta-llama/Llama-3.3-70B-Instruct-Turbo (aliases: meta-llama/Llama-3.3-70B-Instruct)`
- `meta-llama/Meta-Llama-Guard-3-8B (aliases: meta-llama/Llama-Guard-3-8B)`
- `meta-llama/Llama-Guard-3-11B-Vision-Turbo (aliases: meta-llama/Llama-Guard-3-11B-Vision)`
- `togethercomputer/m2-bert-80M-8k-retrieval `
- `togethercomputer/m2-bert-80M-32k-retrieval `
- `meta-llama/Llama-4-Scout-17B-16E-Instruct (aliases: meta-llama/Llama-4-Scout-17B-16E-Instruct, together/meta-llama/Llama-4-Scout-17B-16E-Instruct)`
- `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (aliases: meta-llama/Llama-4-Maverick-17B-128E-Instruct, together/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8)`
-
-
-### Prerequisite: API Keys
-
-Make sure you have access to a Together API Key. You can get one by visiting [together.xyz](https://together.xyz/).
-
-
-## Running Llama Stack with Together
-
-You can do this via Conda (build code) or Docker which has a pre-built image.
-
-### Via Docker
-
-This method allows you to get started quickly without having to build the distribution code.
-
-```bash
-LLAMA_STACK_PORT=8321
-docker run \
-  -it \
-  --pull always \
-  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-  llamastack/distribution-together \
-  --port $LLAMA_STACK_PORT \
-  --env TOGETHER_API_KEY=$TOGETHER_API_KEY
-```
-
-### Via Conda
-
-```bash
-llama stack build --template together --image-type conda
-llama stack run ./run.yaml \
-  --port $LLAMA_STACK_PORT \
-  --env TOGETHER_API_KEY=$TOGETHER_API_KEY
-```
--- a/docs/source/getting_started/detailed_tutorial.md
+++ b/docs/source/getting_started/detailed_tutorial.md
@ -58,7 +58,7 @@ which defines the providers and their settings.
 Now let's build and run the Llama Stack config for Ollama.

 ```bash
-INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type venv --run
+INFERENCE_MODEL=llama3.2:3b llama stack build --template starter --image-type venv --run
 ```
 :::
 :::{tab-item} Using `conda`
@ -69,7 +69,7 @@ which defines the providers and their settings.
 Now let's build and run the Llama Stack config for Ollama.

 ```bash
-INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type conda  --image-name llama3-3b-conda --run
+INFERENCE_MODEL=llama3.2:3b llama stack build --template starter --image-type conda  --image-name llama3-3b-conda --run
 ```
 :::
 :::{tab-item} Using a Container
--- a/docs/source/getting_started/index.md
+++ b/docs/source/getting_started/index.md
@ -19,7 +19,7 @@ ollama run llama3.2:3b --keepalive 60m
 #### Step 2: Run the Llama Stack server
 We will use `uv` to run the Llama Stack server.
 ```bash
-INFERENCE_MODEL=llama3.2:3b uv run --with llama-stack llama stack build --template ollama --image-type venv --run
+INFERENCE_MODEL=llama3.2:3b uv run --with llama-stack llama stack build --template starter --image-type venv --run
 ```
 #### Step 3: Run the demo
 Now open up a new terminal and copy the following script into a file named `demo_script.py`.
--- a/docs/source/providers/inference/remote_ollama.md
+++ b/docs/source/providers/inference/remote_ollama.md
@ -9,13 +9,11 @@ Ollama inference provider for running local models through the Ollama runtime.
 | Field | Type | Required | Default | Description |
 |-------|------|----------|---------|-------------|
 | `url` | `<class 'str'>` | No | http://localhost:11434 |  |
-| `raise_on_connect_error` | `<class 'bool'>` | No | True |  |

 ## Sample Configuration

 ```yaml
 url: ${env.OLLAMA_URL:=http://localhost:11434}
-raise_on_connect_error: true

 ```

--- a/docs/source/providers/inference/remote_runpod.md
+++ b/docs/source/providers/inference/remote_runpod.md
@ -15,7 +15,7 @@ RunPod inference provider for running models on RunPod's cloud GPU platform.

 ```yaml
 url: ${env.RUNPOD_URL:=}
-api_token: ${env.RUNPOD_API_TOKEN:=}
+api_token: ${env.RUNPOD_API_TOKEN}

 ```

--- a/docs/source/providers/inference/remote_together.md
+++ b/docs/source/providers/inference/remote_together.md
@ -15,7 +15,7 @@ Together AI inference provider for open-source models and collaborative AI devel

 ```yaml
 url: https://api.together.xyz/v1
-api_key: ${env.TOGETHER_API_KEY:=}
+api_key: ${env.TOGETHER_API_KEY}

 ```

--- a/docs/source/providers/post_training/huggingface.md
+++ b/docs/source/providers/post_training/huggingface.md
@ -23,7 +23,7 @@ To use the HF SFTTrainer in your Llama Stack project, follow these steps:
 You can access the HuggingFace trainer via the `ollama` distribution:

 ```bash
-llama stack build --template ollama --image-type venv
+llama stack build --template starter --image-type venv
 llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml
 ```