Merge branch 'main' into snowflake-llama-stack

2025-08-06 10:42:39 +00:00 · 2025-01-24 09:37:57 -05:00 · 2025-01-24 09:37:57 -05:00 · 35cbed4b6a
commit 35cbed4b6a
parent bbf4dbd9ff 2118f37350
737 changed files with 68180 additions and 21520 deletions
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@ -2,4 +2,4 @@

 # These owners will be the default owners for everything in
 # the repo. Unless a later match takes precedence,
-* @ashwinb @yanxi0830 @hardikjshah @dltn @raghotham
+* @ashwinb @yanxi0830 @hardikjshah @dltn @raghotham @dineshyv @vladimirivic @sixianyi0721
--- a/.github/ISSUE_TEMPLATE/feature-request.yml
+++ b/.github/ISSUE_TEMPLATE/feature-request.yml
@ -1,31 +1,28 @@
 name: 🚀 Feature request
-description: Submit a proposal/request for a new llama-stack feature
+description: Request a new llama-stack feature

 body:
 - type: textarea
  id: feature-pitch
  attributes:
-    label: 🚀 The feature, motivation and pitch
+    label: 🚀 Describe the new functionality needed
    description: >
-      A clear and concise description of the feature proposal. Please outline the motivation for the proposal. Is your feature request related to a specific problem? e.g., *"I'm working on X and would like Y to be possible"*. If this is related to another GitHub issue, please link here too.
+      A clear and concise description of _what_ needs to be built.
  validations:
    required: true

 - type: textarea
-  id: alternatives
+  id: feature-motivation
  attributes:
-    label: Alternatives
+    label: 💡 Why is this needed? What if we don't build it?
    description: >
-      A description of any alternative solutions or features you've considered, if any.
+      A clear and concise description of _why_ this functionality is needed.
+  validations:
+    required: true

 - type: textarea
-  id: additional-context
+  id: other-thoughts
  attributes:
-    label: Additional context
+    label: Other thoughts
    description: >
-      Add any other context or screenshots about the feature request.
-
- type: markdown
-  attributes:
-    value: >
-      Thanks for contributing 🎉!
+      Any thoughts about how this may result in complexity in the codebase, or other trade-offs.
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@ -1,17 +1,15 @@
 # What does this PR do?

-Closes # (issue)
+In short, provide a summary of what this PR does and why. Usually, the relevant context should be present in a linked issue.

-## Feature/Issue validation/testing/test plan
+- [ ] Addresses issue (#issue)

-Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
-Please also list any relevant details for your test configuration or test plan.

- [ ] Test A
-Logs for Test A
+## Test Plan

- [ ] Test B
-Logs for Test B
+Please describe:
+ - tests you ran to verify your changes with result summaries.
+ - provide instructions so it can be reproduced.


 ## Sources
@ -20,12 +18,10 @@ Please link relevant resources if necessary.


 ## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
-      Pull Request section?
- [ ] Was this discussed/approved via a Github issue? Please add a link
-      to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
- [ ] Did you write any new necessary tests?

-Thanks for contributing 🎉!
+- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
+- [ ] Ran pre-commit to handle lint / formatting issues.
+- [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
+      Pull Request section?
+- [ ] Updated relevant documentation.
+- [ ] Wrote necessary unit or integration tests.
--- a/.github/workflows/gha_workflow_llama_stack_tests.yml
+++ b/.github/workflows/gha_workflow_llama_stack_tests.yml
@ -0,0 +1,355 @@
+name: "Run Llama-stack Tests"
+
+on:
+  #### Temporarily disable PR runs until tests run as intended within mainline.
+  #TODO Add this back.
+  #pull_request_target:
+  #  types: ["opened"]
+  #  branches:
+  #    - 'main'
+  #  paths:
+  #    - 'llama_stack/**/*.py'
+  #    - 'tests/**/*.py'
+
+  workflow_dispatch:
+    inputs:
+      runner:
+        description: 'GHA Runner Scale Set label to run workflow on.'
+        required: true
+        default: "llama-stack-gha-runner-gpu"
+
+      checkout_reference:
+        description: "The branch, tag, or SHA to checkout"
+        required: true
+        default: "main"
+
+      debug:
+        description: 'Run debugging steps?'
+        required: false
+        default: "true"
+
+      sleep_time:
+        description: '[DEBUG] sleep time for debugging'
+        required: true
+        default: "0"
+
+      provider_id:
+        description: 'ID of your provider'
+        required: true
+        default: "meta_reference"
+
+      model_id:
+        description: 'Shorthand name for target model ID (llama_3b or llama_8b)'
+        required: true
+        default: "llama_3b"
+
+      model_override_3b:
+        description: 'Specify shorthand model for <llama_3b> '
+        required: false
+        default: "Llama3.2-3B-Instruct"
+
+      model_override_8b:
+        description: 'Specify shorthand model for <llama_8b> '
+        required: false
+        default: "Llama3.1-8B-Instruct"
+
+env:
+  # ID used for each test's provider config
+  PROVIDER_ID: "${{ inputs.provider_id || 'meta_reference' }}"
+
+  # Path to model checkpoints within EFS volume
+  MODEL_CHECKPOINT_DIR: "/data/llama"
+
+  # Path to directory to run tests from
+  TESTS_PATH: "${{ github.workspace }}/llama_stack/providers/tests"
+
+  # Keep track of a list of model IDs that are valid to use within pytest fixture marks
+  AVAILABLE_MODEL_IDs: "llama_3b llama_8b"
+
+  # Shorthand name for model ID, used in pytest fixture marks
+  MODEL_ID: "${{ inputs.model_id || 'llama_3b' }}"
+
+  # Override the `llama_3b` / `llama_8b' models, else use the default.
+  LLAMA_3B_OVERRIDE: "${{ inputs.model_override_3b || 'Llama3.2-3B-Instruct' }}"
+  LLAMA_8B_OVERRIDE: "${{ inputs.model_override_8b || 'Llama3.1-8B-Instruct' }}"
+
+  # Defines which directories in TESTS_PATH to exclude from the test loop
+  EXCLUDED_DIRS: "__pycache__"
+
+  # Defines the output xml reports generated after a test is run
+  REPORTS_GEN: ""
+
+jobs:
+  execute_workflow:
+    name: Execute workload on Self-Hosted GPU k8s runner
+    permissions:
+      pull-requests: write
+    defaults:
+      run:
+        shell: bash
+    runs-on: ${{ inputs.runner != '' && inputs.runner || 'llama-stack-gha-runner-gpu' }}
+    if: always()
+    steps:
+
+      ##############################
+      #### INITIAL DEBUG CHECKS ####
+      ##############################
+      - name: "[DEBUG] Check content of the EFS mount"
+        id: debug_efs_volume
+        continue-on-error: true
+        if: inputs.debug == 'true'
+        run: |
+            echo "========= Content of the EFS mount ============="
+            ls -la ${{ env.MODEL_CHECKPOINT_DIR }}
+
+      - name: "[DEBUG] Get runner container OS information"
+        id: debug_os_info
+        if: ${{ inputs.debug == 'true' }}
+        run: |
+            cat /etc/os-release
+
+      - name: "[DEBUG] Print environment variables"
+        id: debug_env_vars
+        if: ${{ inputs.debug == 'true' }}
+        run: |
+            echo "PROVIDER_ID = ${PROVIDER_ID}"
+            echo "MODEL_CHECKPOINT_DIR = ${MODEL_CHECKPOINT_DIR}"
+            echo "AVAILABLE_MODEL_IDs = ${AVAILABLE_MODEL_IDs}"
+            echo "MODEL_ID = ${MODEL_ID}"
+            echo "LLAMA_3B_OVERRIDE = ${LLAMA_3B_OVERRIDE}"
+            echo "LLAMA_8B_OVERRIDE = ${LLAMA_8B_OVERRIDE}"
+            echo "EXCLUDED_DIRS = ${EXCLUDED_DIRS}"
+            echo "REPORTS_GEN = ${REPORTS_GEN}"
+
+      ############################
+      #### MODEL INPUT CHECKS ####
+      ############################
+
+      - name: "Check if env.model_id is valid"
+        id: check_model_id
+        run: |
+          if [[ " ${AVAILABLE_MODEL_IDs[@]} " =~ " ${MODEL_ID} " ]]; then
+            echo "Model ID '${MODEL_ID}' is valid."
+          else
+            echo "Model ID '${MODEL_ID}' is invalid. Terminating workflow."
+            exit 1
+          fi
+
+      #######################
+      #### CODE CHECKOUT ####
+      #######################
+      - name: "Checkout 'meta-llama/llama-stack' repository"
+        id: checkout_repo
+        uses: actions/checkout@v4
+        with:
+          ref: ${{ inputs.branch }}
+
+      - name: "[DEBUG] Content of the repository after checkout"
+        id: debug_content_after_checkout
+        if: ${{ inputs.debug == 'true' }}
+        run: |
+            ls -la ${GITHUB_WORKSPACE}
+
+      ##########################################################
+      ####              OPTIONAL SLEEP DEBUG                ####
+      #                                                        #
+      # Use to "exec" into the test k8s POD and run tests      #
+      # manually to identify what dependencies are being used. #
+      #                                                        #
+      ##########################################################
+      - name: "[DEBUG] sleep"
+        id: debug_sleep
+        if: ${{ inputs.debug == 'true' && inputs.sleep_time != '' }}
+        run: |
+            sleep ${{ inputs.sleep_time }}
+
+      ############################
+      #### UPDATE SYSTEM PATH ####
+      ############################
+      - name: "Update path: execute"
+        id: path_update_exec
+        run: |
+          # .local/bin is needed for certain libraries installed below to be recognized
+          # when calling their executable to install sub-dependencies
+          mkdir -p ${HOME}/.local/bin
+          echo "${HOME}/.local/bin" >> "$GITHUB_PATH"
+
+      #####################################
+      #### UPDATE CHECKPOINT DIRECTORY ####
+      #####################################
+      - name: "Update checkpoint directory"
+        id: checkpoint_update
+        run: |
+          echo "Checkpoint directory: ${MODEL_CHECKPOINT_DIR}/$LLAMA_3B_OVERRIDE"
+          if [ "${MODEL_ID}" = "llama_3b" ] && [ -d "${MODEL_CHECKPOINT_DIR}/${LLAMA_3B_OVERRIDE}" ]; then
+            echo "MODEL_CHECKPOINT_DIR=${MODEL_CHECKPOINT_DIR}/${LLAMA_3B_OVERRIDE}" >> "$GITHUB_ENV"
+          elif [ "${MODEL_ID}" = "llama_8b" ] && [ -d "${MODEL_CHECKPOINT_DIR}/${LLAMA_8B_OVERRIDE}" ]; then
+            echo "MODEL_CHECKPOINT_DIR=${MODEL_CHECKPOINT_DIR}/${LLAMA_8B_OVERRIDE}" >> "$GITHUB_ENV"
+          else
+            echo "MODEL_ID & LLAMA_*B_OVERRIDE are not a valid pairing. Terminating workflow."
+            exit 1
+          fi
+
+      - name: "[DEBUG] Checkpoint update check"
+        id: debug_checkpoint_update
+        if: ${{ inputs.debug == 'true' }}
+        run: |
+          echo "MODEL_CHECKPOINT_DIR (after update) = ${MODEL_CHECKPOINT_DIR}"
+
+      ##################################
+      #### DEPENDENCY INSTALLATIONS ####
+      ##################################
+      - name: "Installing 'apt' required packages"
+        id: install_apt
+        run: |
+          echo "[STEP] Installing 'apt' required packages"
+          sudo apt update -y
+          sudo apt install -y python3 python3-pip npm wget
+
+      - name: "Installing packages with 'curl'"
+        id: install_curl
+        run: |
+          curl -fsSL https://ollama.com/install.sh | sh
+
+      - name: "Installing packages with 'wget'"
+        id: install_wget
+        run: |
+          wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
+          chmod +x Miniconda3-latest-Linux-x86_64.sh
+          ./Miniconda3-latest-Linux-x86_64.sh -b install -c pytorch -c nvidia faiss-gpu=1.9.0
+          # Add miniconda3 bin to system path
+          echo "${HOME}/miniconda3/bin" >> "$GITHUB_PATH"
+
+      - name: "Installing packages with 'npm'"
+        id: install_npm_generic
+        run: |
+          sudo npm install -g junit-merge
+
+      - name: "Installing pip dependencies"
+        id: install_pip_generic
+        run: |
+          echo "[STEP] Installing 'llama-stack' models"
+          pip install -U pip setuptools
+          pip install -r requirements.txt
+          pip install -e .
+          pip install -U \
+            torch torchvision \
+            pytest pytest_asyncio \
+            fairscale lm-format-enforcer \
+            zmq chardet pypdf \
+            pandas sentence_transformers together \
+            aiosqlite
+      - name: "Installing packages with conda"
+        id: install_conda_generic
+        run: |
+          conda install -q -c pytorch -c nvidia faiss-gpu=1.9.0
+
+      #############################################################
+      #### TESTING TO BE DONE FOR BOTH PRS AND MANUAL DISPATCH ####
+      #############################################################
+      - name: "Run Tests: Loop"
+        id: run_tests_loop
+        working-directory: "${{ github.workspace }}"
+        run: |
+          pattern=""
+          for dir in llama_stack/providers/tests/*; do
+            if [ -d "$dir" ]; then
+              dir_name=$(basename "$dir")
+              if [[ ! " $EXCLUDED_DIRS " =~ " $dir_name " ]]; then
+                for file in "$dir"/test_*.py; do
+                  test_name=$(basename "$file")
+                  new_file="result-${dir_name}-${test_name}.xml"
+                  if torchrun $(which pytest) -s -v ${TESTS_PATH}/${dir_name}/${test_name} -m "${PROVIDER_ID} and ${MODEL_ID}" \
+                     --junitxml="${{ github.workspace }}/${new_file}"; then
+                    echo "Ran test: ${test_name}"
+                  else
+                    echo "Did NOT run test: ${test_name}"
+                  fi
+                  pattern+="${new_file} "
+                done
+              fi
+            fi
+          done
+          echo "REPORTS_GEN=$pattern" >> "$GITHUB_ENV"
+
+      - name: "Test Summary: Merge"
+        id: test_summary_merge
+        working-directory: "${{ github.workspace }}"
+        run: |
+          echo "Merging the following test result files: ${REPORTS_GEN}"
+          # Defaults to merging them into 'merged-test-results.xml'
+          junit-merge ${{ env.REPORTS_GEN }}
+
+      ############################################
+      #### AUTOMATIC TESTING ON PULL REQUESTS ####
+      ############################################
+
+      #### Run tests ####
+
+      - name: "PR - Run Tests"
+        id: pr_run_tests
+        working-directory: "${{ github.workspace }}"
+        if: github.event_name == 'pull_request_target'
+        run: |
+          echo "[STEP] Running PyTest tests at 'GITHUB_WORKSPACE' path: ${GITHUB_WORKSPACE} | path: ${{ github.workspace }}"
+          # (Optional) Add more tests here.
+
+          # Merge test results with 'merged-test-results.xml' from above.
+          # junit-merge <new-test-results> merged-test-results.xml
+
+      #### Create test summary ####
+
+      - name: "PR - Test Summary"
+        id: pr_test_summary_create
+        if: github.event_name == 'pull_request_target'
+        uses: test-summary/action@v2
+        with:
+          paths: "${{ github.workspace }}/merged-test-results.xml"
+          output: test-summary.md
+
+      - name: "PR - Upload Test Summary"
+        id: pr_test_summary_upload
+        if: github.event_name == 'pull_request_target'
+        uses: actions/upload-artifact@v3
+        with:
+          name: test-summary
+          path: test-summary.md
+
+      #### Update PR request ####
+
+      - name: "PR - Update comment"
+        id: pr_update_comment
+        if: github.event_name == 'pull_request_target'
+        uses: thollander/actions-comment-pull-request@v2
+        with:
+          filePath: test-summary.md
+
+      ########################
+      #### MANUAL TESTING ####
+      ########################
+
+      #### Run tests ####
+
+      - name: "Manual - Run Tests: Prep"
+        id: manual_run_tests
+        working-directory: "${{ github.workspace }}"
+        if: github.event_name == 'workflow_dispatch'
+        run: |
+          echo "[STEP] Running PyTest tests at 'GITHUB_WORKSPACE' path: ${{ github.workspace }}"
+
+          #TODO Use this when collection errors are resolved
+          # pytest -s -v -m "${PROVIDER_ID} and ${MODEL_ID}" --junitxml="${{ github.workspace }}/merged-test-results.xml"
+
+          # (Optional) Add more tests here.
+
+          # Merge test results with 'merged-test-results.xml' from above.
+          # junit-merge <new-test-results> merged-test-results.xml
+
+      #### Create test summary ####
+
+      - name: "Manual - Test Summary"
+        id: manual_test_summary
+        if: always() && github.event_name == 'workflow_dispatch'
+        uses: test-summary/action@v2
+        with:
+          paths: "${{ github.workspace }}/merged-test-results.xml"
--- a/.github/workflows/publish-to-docker.yml
+++ b/.github/workflows/publish-to-docker.yml
@ -0,0 +1,138 @@
+name: Docker Build and Publish
+
+on:
+  workflow_dispatch:
+    inputs:
+      version:
+        description: 'TestPyPI or PyPI version to build (e.g., 0.0.63.dev20250114)'
+        required: true
+        type: string
+
+jobs:
+  build-and-push:
+    runs-on: ubuntu-latest
+    env:
+      TOGETHER_API_KEY: ${{ secrets.TOGETHER_API_KEY }}
+      FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}
+      TAVILY_SEARCH_API_KEY: ${{ secrets.TAVILY_SEARCH_API_KEY }}
+    permissions:
+      contents: read
+      packages: write
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+
+      - name: Log in to the Container registry
+        uses: docker/login-action@v3
+        with:
+          username: ${{ secrets.DOCKERHUB_USERNAME }}
+          password: ${{ secrets.DOCKERHUB_TOKEN }}
+
+      - name: Set version
+        id: version
+        run: |
+          if [ "${{ github.event_name }}" = "push" ]; then
+            echo "VERSION=0.0.63.dev51206766" >> $GITHUB_OUTPUT
+          else
+            echo "VERSION=${{ inputs.version }}" >> $GITHUB_OUTPUT
+          fi
+
+      - name: Check package version availability
+        run: |
+            # Function to check if version exists in a repository
+            check_version() {
+                local repo=$1
+                local status_code=$(curl -s -o /dev/null -w "%{http_code}" "https://$repo.org/project/llama-stack/${{ steps.version.outputs.version }}")
+                return $([ "$status_code" -eq 200 ])
+            }
+
+            # Check TestPyPI first, then PyPI
+            if check_version "test.pypi"; then
+                echo "Version ${{ steps.version.outputs.version }} found in TestPyPI"
+                echo "PYPI_SOURCE=testpypi" >> $GITHUB_ENV
+            elif check_version "pypi"; then
+                echo "Version ${{ steps.version.outputs.version }} found in PyPI"
+                echo "PYPI_SOURCE=pypi" >> $GITHUB_ENV
+            else
+                echo "Error: Version ${{ steps.version.outputs.version }} not found in either TestPyPI or PyPI"
+                exit 1
+            fi
+
+      - name: Install llama-stack
+        run: |
+            if [ "${{ github.event_name }}" = "push" ]; then
+                pip install -e .
+            else
+                if [ "$PYPI_SOURCE" = "testpypi" ]; then
+                    pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple llama-stack==${{ steps.version.outputs.version }}
+                else
+                    pip install llama-stack==${{ steps.version.outputs.version }}
+                fi
+            fi
+
+      - name: Build docker image
+        run: |
+          TEMPLATES=("ollama" "bedrock" "remote-vllm" "fireworks" "together" "tgi" "meta-reference-gpu")
+          for template in "${TEMPLATES[@]}"; do
+            if [ "$PYPI_SOURCE" = "testpypi" ]; then
+                TEST_PYPI_VERSION=${{ steps.version.outputs.version }} llama stack build --template $template --image-type container
+            else
+                PYPI_VERSION=${{ steps.version.outputs.version }} llama stack build --template $template --image-type container
+            fi
+          done
+
+      - name: List docker images
+        run: |
+          docker images
+
+      # TODO (xiyan): make the following 2 steps into a matrix and test all templates other than fireworks
+      - name: Start up built docker image
+        run: |
+          cd distributions/fireworks
+          if [ "$PYPI_SOURCE" = "testpypi" ]; then
+            sed -i 's|image: llamastack/distribution-fireworks|image: distribution-fireworks:test-${{ steps.version.outputs.version }}|' ./compose.yaml
+          else
+            sed -i 's|image: llamastack/distribution-fireworks|image: distribution-fireworks:${{ steps.version.outputs.version }}|' ./compose.yaml
+          fi
+          docker compose up -d
+          cd ..
+          # Wait for the container to start
+          timeout=300
+          while ! curl -s -f http://localhost:8321/v1/version > /dev/null && [ $timeout -gt 0 ]; do
+            echo "Waiting for endpoint to be available..."
+            sleep 5
+            timeout=$((timeout - 5))
+          done
+
+          if [ $timeout -le 0 ]; then
+            echo "Timeout waiting for endpoint to become available"
+            exit 1
+          fi
+
+      - name: Run simple models list test on docker server
+        run: |
+          curl http://localhost:8321/v1/models
+
+      # TODO (xiyan): figure out why client cannot find server but curl works
+      # - name: Run pytest on docker server
+      #   run: |
+      #     pip install pytest pytest-md-report
+      #     export LLAMA_STACK_BASE_URL="http://localhost:8321"
+      #     LLAMA_STACK_BASE_URL="http://localhost:8321" pytest -v tests/client-sdk/inference/test_inference.py --md-report --md-report-verbose=1
+
+      - name: Push to dockerhub
+        run: |
+          TEMPLATES=("ollama" "bedrock" "remote-vllm" "fireworks" "together" "tgi" "meta-reference-gpu")
+          for template in "${TEMPLATES[@]}"; do
+            if [ "$PYPI_SOURCE" = "testpypi" ]; then
+                docker tag distribution-$template:test-${{ steps.version.outputs.version }} llamastack/distribution-$template:test-${{ steps.version.outputs.version }}
+                docker push llamastack/distribution-$template:test-${{ steps.version.outputs.version }}
+            else
+                docker tag distribution-$template:${{ steps.version.outputs.version }} llamastack/distribution-$template:${{ steps.version.outputs.version }}
+                docker push llamastack/distribution-$template:${{ steps.version.outputs.version }}
+            fi
+          done
--- a/.github/workflows/publish-to-test-pypi.yml
+++ b/.github/workflows/publish-to-test-pypi.yml
@ -0,0 +1,244 @@
+name: Publish Python 🐍 distribution 📦 to TestPyPI
+
+on:
+  workflow_dispatch:  # Keep manual trigger
+    inputs:
+      version:
+        description: 'Version number (e.g. 0.0.63.dev20250111)'
+        required: true
+        type: string
+  schedule:
+    - cron: "0 0 * * *"  # Run every day at midnight
+
+jobs:
+  trigger-client-and-models-build:
+    name: Trigger llama-stack-client and llama-models build
+    runs-on: ubuntu-latest
+    outputs:
+      version: ${{ steps.version.outputs.version }}
+      client_run_id: ${{ steps.trigger-client.outputs.workflow_id }}
+      model_run_id: ${{ steps.trigger-models.outputs.workflow_id }}
+    steps:
+    - uses: actions/checkout@v4
+      with:
+        persist-credentials: false
+    - name: Get date
+      id: date
+      run: echo "date=$(date +'%Y%m%d')" >> $GITHUB_OUTPUT
+    - name: Compute version based on dispatch event
+      id: version
+      run: |
+        # Read base version from pyproject.toml
+        version=$(sed -n 's/.*version="\([^"]*\)".*/\1/p' setup.py)
+        if [ "${{ github.event_name }}" = "schedule" ]; then
+          echo "version=${version}.dev${{ steps.date.outputs.date }}" >> $GITHUB_OUTPUT
+        elif [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
+          echo "version=${{ inputs.version }}" >> $GITHUB_OUTPUT
+        else
+          echo "version=${version}.dev$(shuf -i 10000000-99999999 -n 1)" >> $GITHUB_OUTPUT
+        fi
+    - name: Trigger llama-stack-client workflow
+      id: trigger-client
+      run: |
+        response=$(curl -X POST https://api.github.com/repos/meta-llama/llama-stack-client-python/dispatches \
+        -H 'Accept: application/vnd.github.everest-preview+json' \
+        -H "authorization: Bearer ${{ secrets.PAT_TOKEN }}" \
+        --data "{\"event_type\": \"build-client-package\", \"client_payload\": {\"source\": \"llama-stack-nightly\", \"version\": \"${{ steps.version.outputs.version }}\"}}" \
+        -w "\n%{http_code}")
+
+        http_code=$(echo "$response" | tail -n1)
+        if [ "$http_code" != "204" ]; then
+          echo "Failed to trigger client workflow"
+          exit 1
+        fi
+
+        # Get the run ID of the triggered workflow
+        sleep 5  # Wait for workflow to be created
+        run_id=$(curl -s -H "authorization: Bearer ${{ secrets.PAT_TOKEN }}" \
+                 "https://api.github.com/repos/meta-llama/llama-stack-client-python/actions/runs?event=repository_dispatch" \
+                 | jq '.workflow_runs[0].id')
+        echo "workflow_id=$run_id" >> $GITHUB_OUTPUT
+
+    - name: Trigger llama-models workflow
+      id: trigger-models
+      run: |
+        response=$(curl -X POST https://api.github.com/repos/meta-llama/llama-models/dispatches \
+        -H 'Accept: application/vnd.github.everest-preview+json' \
+        -H "authorization: Bearer ${{ secrets.PAT_TOKEN }}" \
+        --data "{\"event_type\": \"build-models-package\", \"client_payload\": {\"source\": \"llama-stack-nightly\", \"version\": \"${{ steps.version.outputs.version }}\"}}" \
+        -w "\n%{http_code}")
+
+        http_code=$(echo "$response" | tail -n1)
+        if [ "$http_code" != "204" ]; then
+          echo "Failed to trigger models workflow"
+          exit 1
+        fi
+
+        # Get the run ID of the triggered workflow
+        sleep 5  # Wait for workflow to be created
+        run_id=$(curl -s -H "authorization: Bearer ${{ secrets.PAT_TOKEN }}" \
+                 "https://api.github.com/repos/meta-llama/llama-models/actions/runs?event=repository_dispatch" \
+                 | jq '.workflow_runs[0].id')
+        echo "workflow_id=$run_id" >> $GITHUB_OUTPUT
+
+  wait-for-workflows:
+    name: Wait for triggered workflows
+    needs: trigger-client-and-models-build
+    runs-on: ubuntu-latest
+    steps:
+    - name: Wait for client workflow
+      run: |
+        while true; do
+          status=$(curl -s -H "authorization: Bearer ${{ secrets.PAT_TOKEN }}" \
+                   "https://api.github.com/repos/meta-llama/llama-stack-client-python/actions/runs/${{ needs.trigger-client-and-models-build.outputs.client_run_id }}" \
+                   | jq -r '.status')
+          conclusion=$(curl -s -H "authorization: Bearer ${{ secrets.PAT_TOKEN }}" \
+                      "https://api.github.com/repos/meta-llama/llama-stack-client-python/actions/runs/${{ needs.trigger-client-and-models-build.outputs.client_run_id }}" \
+                      | jq -r '.conclusion')
+
+          echo "llama-stack-client-python workflow status: $status, conclusion: $conclusion"
+
+          if [ "$status" = "completed" ]; then
+            if [ "$conclusion" != "success" ]; then
+              echo "llama-stack-client-python workflow failed"
+              exit 1
+            fi
+            break
+          fi
+
+          sleep 10
+        done
+
+    - name: Wait for models workflow
+      run: |
+        while true; do
+          status=$(curl -s -H "authorization: Bearer ${{ secrets.PAT_TOKEN }}" \
+                   "https://api.github.com/repos/meta-llama/llama-models/actions/runs/${{ needs.trigger-client-and-models-build.outputs.model_run_id }}" \
+                   | jq -r '.status')
+          conclusion=$(curl -s -H "authorization: Bearer ${{ secrets.PAT_TOKEN }}" \
+                      "https://api.github.com/repos/meta-llama/llama-models/actions/runs/${{ needs.trigger-client-and-models-build.outputs.model_run_id }}" \
+                      | jq -r '.conclusion')
+
+          echo "llama-models workflow status: $status, conclusion: $conclusion"
+
+          if [ "$status" = "completed" ]; then
+            if [ "$conclusion" != "success" ]; then
+              echo "llama-models workflow failed"
+              exit 1
+            fi
+            break
+          fi
+
+          sleep 10
+        done
+
+  build:
+    name: Build distribution 📦
+    needs:
+      - wait-for-workflows
+      - trigger-client-and-models-build
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v4
+      with:
+        persist-credentials: false
+    - name: Get date
+      id: date
+      run: echo "date=$(date +'%Y%m%d')" >> $GITHUB_OUTPUT
+    - name: Update version for nightly
+      run: |
+        sed -i 's/version="\([^"]*\)"/version="${{ needs.trigger-client-and-models-build.outputs.version }}"/' setup.py
+        sed -i 's/llama-stack-client>=\([^"]*\)/llama-stack-client==${{ needs.trigger-client-and-models-build.outputs.version }}/' requirements.txt
+        sed -i 's/llama-models>=\([^"]*\)/llama-models==${{ needs.trigger-client-and-models-build.outputs.version }}/' requirements.txt
+    - name: Set up Python
+      uses: actions/setup-python@v5
+      with:
+        python-version: "3.11"
+    - name: Install pypa/build
+      run: >-
+        python3 -m
+        pip install
+        build
+        --user
+    - name: Build a binary wheel and a source tarball
+      run: python3 -m build
+    - name: Store the distribution packages
+      uses: actions/upload-artifact@v4
+      with:
+        name: python-package-distributions
+        path: dist/
+
+  publish-to-testpypi:
+    name: Publish Python 🐍 distribution 📦 to TestPyPI
+    needs:
+    - build
+    runs-on: ubuntu-latest
+
+    environment:
+      name: testrelease
+      url: https://test.pypi.org/p/llama-stack
+
+    permissions:
+      id-token: write  # IMPORTANT: mandatory for trusted publishing
+
+    steps:
+    - name: Download all the dists
+      uses: actions/download-artifact@v4
+      with:
+        name: python-package-distributions
+        path: dist/
+    - name: Publish distribution 📦 to TestPyPI
+      uses: pypa/gh-action-pypi-publish@release/v1
+      with:
+        repository-url: https://test.pypi.org/legacy/
+
+  test-published-package:
+    name: Test published package
+    needs:
+      - publish-to-testpypi
+      - trigger-client-and-models-build
+    runs-on: ubuntu-latest
+    env:
+      TOGETHER_API_KEY: ${{ secrets.TOGETHER_API_KEY }}
+      TAVILY_SEARCH_API_KEY: ${{ secrets.TAVILY_SEARCH_API_KEY }}
+    steps:
+    - uses: actions/checkout@v4
+      with:
+        persist-credentials: false
+    - name: Install the package
+      run: |
+        max_attempts=6
+        attempt=1
+        while [ $attempt -le $max_attempts ]; do
+          echo "Attempt $attempt of $max_attempts to install package..."
+          if pip install --no-cache --index-url https://pypi.org/simple/ --extra-index-url https://test.pypi.org/simple/ llama-stack==${{ needs.trigger-client-and-models-build.outputs.version }}; then
+            echo "Package installed successfully"
+            break
+          fi
+          if [ $attempt -ge $max_attempts ]; then
+            echo "Failed to install package after $max_attempts attempts"
+            exit 1
+          fi
+          attempt=$((attempt + 1))
+          sleep 10
+        done
+    - name: Test the package versions
+      run: |
+        pip list | grep llama_
+    - name: Test CLI commands
+      run: |
+        llama model list
+        llama stack build --list-templates
+        llama model prompt-format -m Llama3.2-11B-Vision-Instruct
+        llama stack list-apis
+        llama stack list-providers inference
+        llama stack list-providers telemetry
+    - name: Test Notebook
+      run: |
+        pip install pytest nbval
+        llama stack build --template together --image-type venv
+        pytest -v -s --nbval-lax ./docs/getting_started.ipynb
+        pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb
+
+    # TODO: add trigger for integration test workflow & docker builds
--- a/.gitignore
+++ b/.gitignore
@ -15,5 +15,7 @@ Package.resolved
 *.ipynb_checkpoints*
 .idea
 .venv/
-.idea
+.vscode
 _build
+docs/src
+pyrightconfig.json
--- a/.gitmodules
+++ b/.gitmodules
@ -1,3 +1,3 @@
 [submodule "llama_stack/providers/impls/ios/inference/executorch"]
-	path = llama_stack/providers/impls/ios/inference/executorch
+	path = llama_stack/providers/inline/ios/inference/executorch
 	url = https://github.com/pytorch/executorch
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -57,3 +57,17 @@ repos:
 #   hooks:
 #     - id: markdown-link-check
 #       args: ['--quiet']
+
+# -   repo: local
+#     hooks:
+#       - id: distro-codegen
+#         name: Distribution Template Codegen
+#         additional_dependencies:
+#           - rich
+#           - pydantic
+#         entry: python -m llama_stack.scripts.distro_codegen
+#         language: python
+#         pass_filenames: false
+#         require_serial: true
+#         files: ^llama_stack/templates/.*$
+#         stages: [manual]
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -0,0 +1,35 @@
+# Changelog
+
+## 0.0.53
+
+### Added
+- Resource-oriented design for models, shields, memory banks, datasets and eval tasks
+- Persistence for registered objects with distribution
+- Ability to persist memory banks created for FAISS
+- PostgreSQL KVStore implementation
+- Environment variable placeholder support in run.yaml files
+- Comprehensive Zero-to-Hero notebooks and quickstart guides
+- Support for quantized models in Ollama
+- Vision models support for Together, Fireworks, Meta-Reference, and Ollama, and vLLM
+- Bedrock distribution with safety shields support
+- Evals API with task registration and scoring functions
+- MMLU and SimpleQA benchmark scoring functions
+- Huggingface dataset provider integration for benchmarks
+- Support for custom dataset registration from local paths
+- Benchmark evaluation CLI tools with visualization tables
+- RAG evaluation scoring functions and metrics
+- Local persistence for datasets and eval tasks
+
+### Changed
+- Split safety into distinct providers (llama-guard, prompt-guard, code-scanner)
+- Changed provider naming convention (`impls` → `inline`, `adapters` → `remote`)
+- Updated API signatures for dataset and eval task registration
+- Restructured folder organization for providers
+- Enhanced Docker build configuration
+- Added version prefixing for REST API routes
+- Enhanced evaluation task registration workflow
+- Improved benchmark evaluation output formatting
+- Restructured evals folder organization for better modularity
+
+### Removed
+- `llama stack configure` command
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -26,13 +26,62 @@ Meta has a [bounty program](http://facebook.com/whitehat/info) for the safe
 disclosure of security bugs. In those cases, please go through the process
 outlined on that page and do not file a public issue.

+
+## Pre-commit Hooks
+
+We use [pre-commit](https://pre-commit.com/) to run linting and formatting checks on your code. You can install the pre-commit hooks by running:
+
+```bash
+$ cd llama-stack
+$ conda activate <your-environment>
+$ pip install pre-commit
+$ pre-commit install
+```
+
+After that, pre-commit hooks will run automatically before each commit.
+
+
 ## Coding Style
 * 2 spaces for indentation rather than tabs
 * 80 character line length
 * ...

-## Tips
-* If you are developing with a llama-stack repository checked out and need your distribution to reflect changes from there, set `LLAMA_STACK_DIR` to that dir when running any of the `llama` CLI commands.
+## Common Tasks
+
+Some tips about common tasks you work on while contributing to Llama Stack:
+
+### Using `llama stack build`
+
+Building a stack image (conda / docker) will use the production version of the `llama-stack`, `llama-models` and `llama-stack-client` packages. If you are developing with a llama-stack repository checked out and need your code to be reflected in the stack image, set `LLAMA_STACK_DIR` and `LLAMA_MODELS_DIR` to the appropriate checked out directories when running any of the `llama` CLI commands.
+
+Example:
+```bash
+$ cd work/
+$ git clone https://github.com/meta-llama/llama-stack.git
+$ git clone https://github.com/meta-llama/llama-models.git
+$ cd llama-stack
+$ LLAMA_STACK_DIR=$(pwd) LLAMA_MODELS_DIR=../llama-models llama stack build --template <...>
+```
+
+
+### Updating Provider Configurations
+
+If you have made changes to a provider's configuration in any form (introducing a new config key, or changing models, etc.), you should run `python llama_stack/scripts/distro_codegen.py` to re-generate various YAML files as well as the documentation. You should not change `docs/source/.../distributions/` files manually as they are auto-generated.
+
+### Building the Documentation
+
+If you are making changes to the documentation at [https://llama-stack.readthedocs.io/en/latest/](https://llama-stack.readthedocs.io/en/latest/), you can use the following command to build the documentation and preview your changes. You will need [Sphinx](https://www.sphinx-doc.org/en/master/) and the readthedocs theme.
+
+```bash
+cd llama-stack/docs
+pip install -r requirements.txt
+pip install sphinx-autobuild
+
+# This will start a local server (usually at http://127.0.0.1:8000) that automatically rebuilds and refreshes when you make changes to the documentation.
+make html
+sphinx-autobuild source build/html
+```
+

 ## License
 By contributing to Llama, you agree that your contributions will be licensed
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -1,4 +1,5 @@
 include requirements.txt
+include distributions/dependencies.json
 include llama_stack/distribution/*.sh
 include llama_stack/cli/scripts/*.sh
-include llama_stack/templates/*/build.yaml
+include llama_stack/templates/*/*.yaml
--- a/README.md
+++ b/README.md
@ -1,73 +1,71 @@
-<img src="https://github.com/user-attachments/assets/2fedfe0f-6df7-4441-98b2-87a1fd95ee1c" width="300" title="Llama Stack Logo" alt="Llama Stack Logo"/>
-
 # Llama Stack

 [![PyPI version](https://img.shields.io/pypi/v/llama_stack.svg)](https://pypi.org/project/llama_stack/)
 [![PyPI - Downloads](https://img.shields.io/pypi/dm/llama-stack)](https://pypi.org/project/llama-stack/)
 [![Discord](https://img.shields.io/discord/1257833999603335178)](https://discord.gg/llama-stack)

-This repository contains the Llama Stack API specifications as well as API Providers and Llama Stack Distributions.
+[**Quick Start**](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html) | [**Documentation**](https://llama-stack.readthedocs.io/en/latest/index.html) | [**Colab Notebook**](./docs/getting_started.ipynb)

-The Llama Stack defines and standardizes the building blocks needed to bring generative AI applications to market. These blocks span the entire development lifecycle: from model training and fine-tuning, through product evaluation, to building and running AI agents in production. Beyond definition, we are building providers for the Llama Stack APIs. These were developing open-source versions and partnering with providers, ensuring developers can assemble AI solutions using consistent, interlocking pieces across platforms. The ultimate goal is to accelerate innovation in the AI space.
+Llama Stack defines and standardizes the core building blocks that simplify AI application development. It codified best practices across the Llama ecosystem. More specifically, it provides

-The Stack APIs are rapidly improving, but still very much work in progress and we invite feedback as well as direct contributions.
+- **Unified API layer** for Inference, RAG, Agents, Tools, Safety, Evals, and Telemetry.
+- **Plugin architecture** to support the rich ecosystem of implementations of the different APIs in different environments like local development, on-premises, cloud, and mobile.
+- **Prepackaged verified distributions** which offer a one-stop solution for developers to get started quickly and reliably in any environment
+- **Multiple developer interfaces** like CLI and SDKs for Python, Node, iOS, and Android
+- **Standalone applications** as examples for how to build production-grade AI applications with Llama Stack

+<div style="text-align: center;">
+  <img
+    src="https://github.com/user-attachments/assets/33d9576d-95ea-468d-95e2-8fa233205a50"
+    width="480"
+    title="Llama Stack"
+    alt="Llama Stack"
+  />
+</div>

-## APIs
+### Llama Stack Benefits
+- **Flexible Options**: Developers can choose their preferred infrastructure without changing APIs and enjoy flexible deployment choice.
+- **Consistent Experience**: With its unified APIs Llama Stack makes it easier to build, test, and deploy AI applications with consistent application behavior.
+- **Robust Ecosystem**: Llama Stack is already integrated with distribution partners (cloud providers, hardware vendors, and AI-focused companies) that offer tailored infrastructure, software, and services for deploying Llama models.

-The Llama Stack consists of the following set of APIs:
+By reducing friction and complexity, Llama Stack empowers developers to focus on what they do best: building transformative generative AI applications.

- Inference
- Safety
- Memory
- Agentic System
- Evaluation
- Post Training
- Synthetic Data Generation
- Reward Scoring
-
-Each of the APIs themselves is a collection of REST endpoints.
-
-
-## API Providers
-
-A Provider is what makes the API real -- they provide the actual implementation backing the API.
-
-As an example, for Inference, we could have the implementation be backed by open source libraries like `[ torch | vLLM | TensorRT ]` as possible options.
-
-A provider can also be just a pointer to a remote REST service -- for example, cloud providers or dedicated inference providers could serve these APIs.
-
-
-## Llama Stack Distribution
-
-A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications.
-
-## Supported Llama Stack Implementations
 ### API Providers
+Here is a list of the various API providers and available distributions to developers started easily,

-
-|  **API Provider Builder** |  **Environments** | **Agents** | **Inference** | **Memory** | **Safety** | **Telemetry** |
-| :----: | :----: | :----: | :----: | :----: | :----: | :----: |
-|  Meta Reference  |  Single Node | :heavy_check_mark:  |  :heavy_check_mark:  |  :heavy_check_mark:  |  :heavy_check_mark:  |  :heavy_check_mark:  |
-|  Fireworks  |  Hosted  | :heavy_check_mark:  | :heavy_check_mark:  |  :heavy_check_mark:  |    |   |
-|  AWS Bedrock  |  Hosted  |    |  :heavy_check_mark:  |    | :heavy_check_mark:  | |
-|  Snowflake  |  Hosted  |    |  :heavy_check_mark:  |    |   |
-|  Together  |  Hosted  |  :heavy_check_mark:  |  :heavy_check_mark:  |   | :heavy_check_mark:  |  |
-|  Ollama  | Single Node   |    |  :heavy_check_mark:  |    |   |
-|  TGI  |  Hosted and Single Node  |    |  :heavy_check_mark:  |    |   |
-| Chroma | Single Node |  |  | :heavy_check_mark: |  |  |
-| PG Vector | Single Node |  |  | :heavy_check_mark: |  |  |
-| PyTorch ExecuTorch | On-device iOS | :heavy_check_mark:  | :heavy_check_mark:  |  |  |
+|                                  **API Provider Builder**                                  |    **Environments**    |     **Agents**     |   **Inference**    |     **Memory**     |     **Safety**     |   **Telemetry**    |
+|:------------------------------------------------------------------------------------------:|:----------------------:|:------------------:|:------------------:|:------------------:|:------------------:|:------------------:|
+|                                       Meta Reference                                       |      Single Node       | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
+|                                          Cerebras                                          |         Hosted         |                    | :heavy_check_mark: |                    |                    |                    |
+|                                         Fireworks                                          |         Hosted         | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |                    |                    |
+|                                        AWS Bedrock                                         |         Hosted         |                    | :heavy_check_mark: |                    | :heavy_check_mark: |                    |
+|                                         Snowflake                                          |         Hosted         |                    | :heavy_check_mark: |                    |                    |                    |
+|                                          Together                                          |         Hosted         | :heavy_check_mark: | :heavy_check_mark: |                    | :heavy_check_mark: |                    |
+|                                            Groq                                            |         Hosted         |                    | :heavy_check_mark: |                    |                    |                    |
+|                                           Ollama                                           |      Single Node       |                    | :heavy_check_mark: |                    |                    |                    |
+|                                            TGI                                             | Hosted and Single Node |                    | :heavy_check_mark: |                    |                    |                    |
+| NVIDIA NIM | Hosted and Single Node |                    | :heavy_check_mark: |                    |                    |                    |
+|                                           Chroma                                           |      Single Node       |                    |                    | :heavy_check_mark: |                    |                    |
+|                                         PG Vector                                          |      Single Node       |                    |                    | :heavy_check_mark: |                    |                    |
+|                                     PyTorch ExecuTorch                                     |     On-device iOS      | :heavy_check_mark: | :heavy_check_mark: |                    |                    |                    |
+|                        vLLM                        | Hosted and Single Node |                    | :heavy_check_mark: |                    |                    |                    |

 ### Distributions
-|  **Distribution Provider** |  **Docker** | **Inference** | **Memory** | **Safety** | **Telemetry** |
-| :----: | :----: | :----: | :----: | :----: | :----: |
-|  Meta Reference |  [Local GPU](https://hub.docker.com/repository/docker/llamastack/llamastack-local-gpu/general), [Local CPU](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general) | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
-|  Dell-TGI | [Local TGI + Chroma](https://hub.docker.com/repository/docker/llamastack/llamastack-local-tgi-chroma/general)  | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |

+A Llama Stack Distribution (or "distro") is a pre-configured bundle of provider implementations for each API component. Distributions make it easy to get started with a specific deployment scenario - you can begin with a local development setup (eg. ollama) and seamlessly transition to production (eg. Fireworks) without changing your application code. Here are some of the distributions we support:

+|               **Distribution**                |                                                                    **Llama Stack Docker**                                                                     |                                                 Start This Distribution                                                  |
+|:---------------------------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------------------:|
+|                Meta Reference                 |           [llamastack/distribution-meta-reference-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-gpu/general)           |      [Guide](https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/meta-reference-gpu.html)      |
+|           Meta Reference Quantized            | [llamastack/distribution-meta-reference-quantized-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-quantized-gpu/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/meta-reference-quantized-gpu.html) |
+|                   Cerebras                    |                     [llamastack/distribution-cerebras](https://hub.docker.com/repository/docker/llamastack/distribution-cerebras/general)                     |   [Guide](https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/cerebras.html)   |
+|                    Ollama                     |                       [llamastack/distribution-ollama](https://hub.docker.com/repository/docker/llamastack/distribution-ollama/general)                       |            [Guide](https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/ollama.html)            |
+|                      TGI                      |                          [llamastack/distribution-tgi](https://hub.docker.com/repository/docker/llamastack/distribution-tgi/general)                          |             [Guide](https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/tgi.html)              |
+|                   Together                    |                     [llamastack/distribution-together](https://hub.docker.com/repository/docker/llamastack/distribution-together/general)                     |           [Guide](https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/together.html)           |
+|                   Fireworks                   |                    [llamastack/distribution-fireworks](https://hub.docker.com/repository/docker/llamastack/distribution-fireworks/general)                    |          [Guide](https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/fireworks.html)           |
+| vLLM |                  [llamastack/distribution-remote-vllm](https://hub.docker.com/repository/docker/llamastack/distribution-remote-vllm/general)                  |         [Guide](https://llama-stack.readthedocs.io/en/latest/distributions/self_hosted_distro/remote-vllm.html)          |

-## Installation
+### Installation

 You have two ways to install this repository:

@ -78,7 +76,8 @@ You have two ways to install this repository:
   ```

 2. **Install from source**:
-   If you prefer to install from the source code, follow these steps:
+   If you prefer to install from the source code, make sure you have [conda installed](https://docs.conda.io/projects/conda/en/stable).
+   Then, follow these steps:
   ```bash
    mkdir -p ~/local
    cd ~/local
@ -88,35 +87,31 @@ You have two ways to install this repository:
    conda activate stack

    cd llama-stack
-    $CONDA_PREFIX/bin/pip install -e .
+    pip install -e .
   ```

-## Documentations
+### Documentation

-The `llama` CLI makes it easy to work with the Llama Stack set of tools. Please find the following docs for details.
+Please checkout our [Documentation](https://llama-stack.readthedocs.io/en/latest/index.html) page for more details.

-* [CLI reference](docs/cli_reference.md)
+* [CLI reference](https://llama-stack.readthedocs.io/en/latest/references/llama_cli_reference/index.html)
    * Guide using `llama` CLI to work with Llama models (download, study prompts), and building/starting a Llama Stack distribution.
-* [Getting Started](docs/getting_started.md)
+* [Getting Started](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html)
    * Quick guide to start a Llama Stack server.
    * [Jupyter notebook](./docs/getting_started.ipynb) to walk-through how to use simple text and vision inference llama_stack_client APIs
-* [Building a Llama Stack Distribution](docs/building_distro.md)
-    * Guide to build a Llama Stack distribution
-* [Distributions](./distributions/)
-    * References to start Llama Stack distributions backed with different API providers.
-* [Developer Cookbook](./docs/developer_cookbook.md)
-    * References to guides to help you get started based on your developer needs.
+    * The complete Llama Stack lesson [Colab notebook](https://colab.research.google.com/drive/1dtVmxotBsI4cGZQNsJRYPrLiDeT0Wnwt) of the new [Llama 3.2 course on Deeplearning.ai](https://learn.deeplearning.ai/courses/introducing-multimodal-llama-3-2/lesson/8/llama-stack).
+    * A [Zero-to-Hero Guide](https://github.com/meta-llama/llama-stack/tree/main/docs/zero_to_hero_guide) that guide you through all the key components of llama stack with code samples.
 * [Contributing](CONTRIBUTING.md)
-    * [Adding a new API Provider](./docs/new_api_provider.md) to walk-through how to add a new API provider.
+    * [Adding a new API Provider](https://llama-stack.readthedocs.io/en/latest/contributing/new_api_provider.html) to walk-through how to add a new API provider.

-## Llama Stack Client SDK
+### Llama Stack Client SDKs

 |  **Language** |  **Client SDK** | **Package** |
 | :----: | :----: | :----: |
 | Python |  [llama-stack-client-python](https://github.com/meta-llama/llama-stack-client-python) | [![PyPI version](https://img.shields.io/pypi/v/llama_stack_client.svg)](https://pypi.org/project/llama_stack_client/)
 | Swift  | [llama-stack-client-swift](https://github.com/meta-llama/llama-stack-client-swift) | [![Swift Package Index](https://img.shields.io/endpoint?url=https%3A%2F%2Fswiftpackageindex.com%2Fapi%2Fpackages%2Fmeta-llama%2Fllama-stack-client-swift%2Fbadge%3Ftype%3Dswift-versions)](https://swiftpackageindex.com/meta-llama/llama-stack-client-swift)
 | Node   | [llama-stack-client-node](https://github.com/meta-llama/llama-stack-client-node) | [![NPM version](https://img.shields.io/npm/v/llama-stack-client.svg)](https://npmjs.org/package/llama-stack-client)
-| Kotlin | [llama-stack-client-kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) |
+| Kotlin | [llama-stack-client-kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) | [![Maven version](https://img.shields.io/maven-central/v/com.llama.llamastack/llama-stack-client-kotlin)](https://central.sonatype.com/artifact/com.llama.llamastack/llama-stack-client-kotlin)

 Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications.

--- a/distributions/README.md
+++ b/distributions/README.md
@ -1,14 +0,0 @@
-# Llama Stack Distribution
-
-A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications.
-
-
-## Quick Start Llama Stack Distributions Guide
-| **Distribution** 	|           **Llama Stack Docker**           	| Start This Distribution 	|    **Inference**   	|     **Agents**     	|     **Memory**     	|     **Safety**     	|    **Telemetry**   	|
-|:----------------:	|:------------------------------------------:	|:-----------------------:	|:------------------:	|:------------------:	|:------------------:	|:------------------:	|:------------------:	|
-|  Meta Reference  	| [llamastack/distribution-meta-reference-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-gpu/general) 	|       [Guide](./meta-reference-gpu/)       	| meta-reference 	| meta-reference 	| meta-reference; remote::pgvector; remote::chromadb	| meta-reference 	| meta-reference	|
-|  Meta Reference Quantized  	| [llamastack/distribution-meta-reference-quantized-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-quantized-gpu/general) 	|       [Guide](./meta-reference-quantized-gpu/)       	| meta-reference-quantized 	| meta-reference 	| meta-reference; remote::pgvector; remote::chromadb	| meta-reference 	| meta-reference	|
-|      Ollama      	|       [llamastack/distribution-ollama](https://hub.docker.com/repository/docker/llamastack/distribution-ollama/general)       	|       [Guide](./ollama/)       	| remote::ollama	| meta-reference 	| remote::pgvector; remote::chromadb 	| remote::ollama 	| meta-reference 	|
-|        TGI       	|         [llamastack/distribution-tgi](https://hub.docker.com/repository/docker/llamastack/distribution-tgi/general)        	|       [Guide](./tgi/)       	| remote::tgi	| meta-reference 	| meta-reference; remote::pgvector; remote::chromadb 	| meta-reference 	| meta-reference 	|
-|        Together       	|         [llamastack/distribution-together](https://hub.docker.com/repository/docker/llamastack/distribution-together/general)        	|       [Guide](./together/)       	| remote::together 	| meta-reference | remote::weaviate | meta-reference 	| meta-reference  	|
-|        Fireworks       	|         [llamastack/distribution-fireworks](https://hub.docker.com/repository/docker/llamastack/distribution-fireworks/general)        	|       [Guide](./fireworks/)       	| remote::fireworks 	| meta-reference | remote::weaviate | meta-reference 	| meta-reference  	|
--- a/distributions/bedrock/compose.yaml
+++ b/distributions/bedrock/compose.yaml
@ -0,0 +1,15 @@
+services:
+  llamastack:
+    image: distribution-bedrock
+    volumes:
+      - ~/.llama:/root/.llama
+      - ./run.yaml:/root/llamastack-run-bedrock.yaml
+    ports:
+      - "8321:8321"
+    entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml_config /root/llamastack-run-bedrock.yaml"
+    deploy:
+      restart_policy:
+        condition: on-failure
+        delay: 3s
+        max_attempts: 5
+        window: 60s
--- a/distributions/bedrock/run.yaml
+++ b/distributions/bedrock/run.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/bedrock/run.yaml
--- a/distributions/cerebras/build.yaml
+++ b/distributions/cerebras/build.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/cerebras/build.yaml
--- a/distributions/cerebras/compose.yaml
+++ b/distributions/cerebras/compose.yaml
@ -0,0 +1,16 @@
+services:
+  llamastack:
+    image: llamastack/distribution-cerebras
+    network_mode: "host"
+    volumes:
+      - ~/.llama:/root/.llama
+      - ./run.yaml:/root/llamastack-run-cerebras.yaml
+    ports:
+      - "8321:8321"
+    entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml_config /root/llamastack-run-cerebras.yaml"
+    deploy:
+      restart_policy:
+        condition: on-failure
+        delay: 3s
+        max_attempts: 5
+        window: 60s
--- a/distributions/cerebras/run.yaml
+++ b/distributions/cerebras/run.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/cerebras/run.yaml
--- a/distributions/databricks/build.yaml
+++ b/distributions/databricks/build.yaml
@ -1 +0,0 @@
-../../llama_stack/templates/databricks/build.yaml
--- a/distributions/dell-tgi/compose.yaml
+++ b/distributions/dell-tgi/compose.yaml
@ -40,7 +40,7 @@ services:
      # Link to TGI run.yaml file
      - ./run.yaml:/root/my-run.yaml
    ports:
-      - "5000:5000"
+      - "8321:8321"
    # Hack: wait for TGI server to start before starting docker
    entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
    restart_policy:
--- a/distributions/dell-tgi/run.yaml
+++ b/distributions/dell-tgi/run.yaml
@ -1,7 +1,6 @@
 version: '2'
-built_at: '2024-10-08T17:40:45.325529'
 image_name: local
-docker_image: null
+container_image: null
 conda_env: local
 apis:
 - shields
@ -19,22 +18,21 @@ providers:
      url: http://127.0.0.1:80
  safety:
  - provider_id: meta0
-    provider_type: meta-reference
+    provider_type: inline::llama-guard
    config:
-      llama_guard_shield:
-        model: Llama-Guard-3-1B
-        excluded_categories: []
-        disable_input_check: false
-        disable_output_check: false
-      prompt_guard_shield:
-        model: Prompt-Guard-86M
+      model: Llama-Guard-3-1B
+      excluded_categories: []
+  - provider_id: meta1
+    provider_type: inline::prompt-guard
+    config:
+      model: Prompt-Guard-86M
  memory:
  - provider_id: meta0
-    provider_type: meta-reference
+    provider_type: inline::faiss
    config: {}
  agents:
  - provider_id: meta0
-    provider_type: meta-reference
+    provider_type: inline::meta-reference
    config:
      persistence_store:
        namespace: null
@ -42,5 +40,5 @@ providers:
        db_path: ~/.llama/runtime/kvstore.db
  telemetry:
  - provider_id: meta0
-    provider_type: meta-reference
+    provider_type: inline::meta-reference
    config: {}
--- a/distributions/dependencies.json
+++ b/distributions/dependencies.json
@ -0,0 +1,457 @@
+{
+  "hf-serverless": [
+    "aiohttp",
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "chardet",
+    "chromadb-client",
+    "datasets",
+    "faiss-cpu",
+    "fastapi",
+    "fire",
+    "httpx",
+    "huggingface_hub",
+    "matplotlib",
+    "mcp",
+    "nltk",
+    "numpy",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pypdf",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentencepiece",
+    "tqdm",
+    "transformers",
+    "uvicorn",
+    "sentence-transformers --no-deps",
+    "torch --index-url https://download.pytorch.org/whl/cpu"
+  ],
+  "together": [
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "chardet",
+    "chromadb-client",
+    "datasets",
+    "faiss-cpu",
+    "fastapi",
+    "fire",
+    "httpx",
+    "matplotlib",
+    "mcp",
+    "nltk",
+    "numpy",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pypdf",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentencepiece",
+    "together",
+    "tqdm",
+    "transformers",
+    "uvicorn",
+    "sentence-transformers --no-deps",
+    "torch --index-url https://download.pytorch.org/whl/cpu"
+  ],
+  "vllm-gpu": [
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "chardet",
+    "chromadb-client",
+    "datasets",
+    "faiss-cpu",
+    "fastapi",
+    "fire",
+    "httpx",
+    "matplotlib",
+    "mcp",
+    "nltk",
+    "numpy",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pypdf",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentencepiece",
+    "tqdm",
+    "transformers",
+    "uvicorn",
+    "vllm",
+    "sentence-transformers --no-deps",
+    "torch --index-url https://download.pytorch.org/whl/cpu"
+  ],
+  "remote-vllm": [
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "chardet",
+    "chromadb-client",
+    "datasets",
+    "faiss-cpu",
+    "fastapi",
+    "fire",
+    "httpx",
+    "matplotlib",
+    "mcp",
+    "nltk",
+    "numpy",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pypdf",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentencepiece",
+    "tqdm",
+    "transformers",
+    "uvicorn",
+    "sentence-transformers --no-deps",
+    "torch --index-url https://download.pytorch.org/whl/cpu"
+  ],
+  "fireworks": [
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "chardet",
+    "chromadb-client",
+    "datasets",
+    "faiss-cpu",
+    "fastapi",
+    "fire",
+    "fireworks-ai",
+    "httpx",
+    "matplotlib",
+    "mcp",
+    "nltk",
+    "numpy",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pypdf",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentencepiece",
+    "tqdm",
+    "transformers",
+    "uvicorn",
+    "sentence-transformers --no-deps",
+    "torch --index-url https://download.pytorch.org/whl/cpu"
+  ],
+  "tgi": [
+    "aiohttp",
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "chardet",
+    "chromadb-client",
+    "datasets",
+    "faiss-cpu",
+    "fastapi",
+    "fire",
+    "httpx",
+    "huggingface_hub",
+    "matplotlib",
+    "mcp",
+    "nltk",
+    "numpy",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pypdf",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentencepiece",
+    "tqdm",
+    "transformers",
+    "uvicorn",
+    "sentence-transformers --no-deps",
+    "torch --index-url https://download.pytorch.org/whl/cpu"
+  ],
+  "bedrock": [
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "boto3",
+    "chardet",
+    "chromadb-client",
+    "datasets",
+    "faiss-cpu",
+    "fastapi",
+    "fire",
+    "httpx",
+    "matplotlib",
+    "mcp",
+    "nltk",
+    "numpy",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pypdf",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentencepiece",
+    "tqdm",
+    "transformers",
+    "uvicorn",
+    "sentence-transformers --no-deps",
+    "torch --index-url https://download.pytorch.org/whl/cpu"
+  ],
+  "meta-reference-gpu": [
+    "accelerate",
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "chardet",
+    "chromadb-client",
+    "datasets",
+    "fairscale",
+    "faiss-cpu",
+    "fastapi",
+    "fire",
+    "httpx",
+    "lm-format-enforcer",
+    "matplotlib",
+    "mcp",
+    "nltk",
+    "numpy",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pypdf",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentence-transformers",
+    "sentencepiece",
+    "torch",
+    "torchvision",
+    "tqdm",
+    "transformers",
+    "uvicorn",
+    "zmq",
+    "sentence-transformers --no-deps",
+    "torch --index-url https://download.pytorch.org/whl/cpu"
+  ],
+  "nvidia": [
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "chardet",
+    "datasets",
+    "faiss-cpu",
+    "fastapi",
+    "fire",
+    "httpx",
+    "matplotlib",
+    "mcp",
+    "nltk",
+    "numpy",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pypdf",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentencepiece",
+    "tqdm",
+    "transformers",
+    "uvicorn",
+    "sentence-transformers --no-deps",
+    "torch --index-url https://download.pytorch.org/whl/cpu"
+  ],
+  "meta-reference-quantized-gpu": [
+    "accelerate",
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "chardet",
+    "chromadb-client",
+    "datasets",
+    "fairscale",
+    "faiss-cpu",
+    "fastapi",
+    "fbgemm-gpu",
+    "fire",
+    "httpx",
+    "lm-format-enforcer",
+    "matplotlib",
+    "mcp",
+    "nltk",
+    "numpy",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pypdf",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentence-transformers",
+    "sentencepiece",
+    "torch",
+    "torchao==0.5.0",
+    "torchvision",
+    "tqdm",
+    "transformers",
+    "uvicorn",
+    "zmq",
+    "sentence-transformers --no-deps",
+    "torch --index-url https://download.pytorch.org/whl/cpu"
+  ],
+  "cerebras": [
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "cerebras_cloud_sdk",
+    "chardet",
+    "chromadb-client",
+    "datasets",
+    "faiss-cpu",
+    "fastapi",
+    "fire",
+    "httpx",
+    "matplotlib",
+    "nltk",
+    "numpy",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pypdf",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentencepiece",
+    "tqdm",
+    "transformers",
+    "uvicorn",
+    "sentence-transformers --no-deps",
+    "torch --index-url https://download.pytorch.org/whl/cpu"
+  ],
+  "ollama": [
+    "aiohttp",
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "chardet",
+    "chromadb-client",
+    "datasets",
+    "faiss-cpu",
+    "fastapi",
+    "fire",
+    "httpx",
+    "matplotlib",
+    "nltk",
+    "numpy",
+    "ollama",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pypdf",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentencepiece",
+    "tqdm",
+    "transformers",
+    "uvicorn",
+    "sentence-transformers --no-deps",
+    "torch --index-url https://download.pytorch.org/whl/cpu"
+  ],
+  "hf-endpoint": [
+    "aiohttp",
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "chardet",
+    "chromadb-client",
+    "datasets",
+    "faiss-cpu",
+    "fastapi",
+    "fire",
+    "httpx",
+    "huggingface_hub",
+    "matplotlib",
+    "mcp",
+    "nltk",
+    "numpy",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pypdf",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentencepiece",
+    "tqdm",
+    "transformers",
+    "uvicorn",
+    "sentence-transformers --no-deps",
+    "torch --index-url https://download.pytorch.org/whl/cpu"
+  ]
+}
--- a/distributions/fireworks/README.md
+++ b/distributions/fireworks/README.md
@ -1,79 +0,0 @@
-# Fireworks Distribution
-
-The `llamastack/distribution-` distribution consists of the following provider configurations.
-
-
-| **API**         	| **Inference** 	| **Agents**     	| **Memory**                                       	| **Safety**     	| **Telemetry**  	|
-|-----------------	|---------------	|----------------	|--------------------------------------------------	|----------------	|----------------	|
-| **Provider(s)** 	| remote::fireworks   	| meta-reference 	| meta-reference 	| meta-reference 	| meta-reference 	|
-
-
-### Start the Distribution (Single Node CPU)
-
-> [!NOTE]
-> This assumes you have an hosted endpoint at Fireworks with API Key.
-
-```
-$ cd distributions/fireworks
-$ ls
-compose.yaml  run.yaml
-$ docker compose up
-```
-
-Make sure in you `run.yaml` file, you inference provider is pointing to the correct Fireworks URL server endpoint. E.g.
-```
-inference:
-  - provider_id: fireworks
-    provider_type: remote::fireworks
-    config:
-      url: https://api.fireworks.ai/inferenc
-      api_key: <optional api key>
-```
-
-### (Alternative) llama stack run (Single Node CPU)
-
-```
-docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-fireworks --yaml_config /root/my-run.yaml
-```
-
-Make sure in you `run.yaml` file, you inference provider is pointing to the correct Fireworks URL server endpoint. E.g.
-```
-inference:
-  - provider_id: fireworks
-    provider_type: remote::fireworks
-    config:
-      url: https://api.fireworks.ai/inference
-      api_key: <enter your api key>
-```
-
-**Via Conda**
-
-```bash
-llama stack build --template fireworks --image-type conda
-# -- modify run.yaml to a valid Fireworks server endpoint
-llama stack run ./run.yaml
-```
-
-### Model Serving
-
-Use `llama-stack-client models list` to chekc the available models served by Fireworks.
-```
-$ llama-stack-client models list
-+------------------------------+------------------------------+---------------+------------+
-| identifier                   | llama_model                  | provider_id   | metadata   |
-+==============================+==============================+===============+============+
-| Llama3.1-8B-Instruct         | Llama3.1-8B-Instruct         | fireworks0    | {}         |
-+------------------------------+------------------------------+---------------+------------+
-| Llama3.1-70B-Instruct        | Llama3.1-70B-Instruct        | fireworks0    | {}         |
-+------------------------------+------------------------------+---------------+------------+
-| Llama3.1-405B-Instruct       | Llama3.1-405B-Instruct       | fireworks0    | {}         |
-+------------------------------+------------------------------+---------------+------------+
-| Llama3.2-1B-Instruct         | Llama3.2-1B-Instruct         | fireworks0    | {}         |
-+------------------------------+------------------------------+---------------+------------+
-| Llama3.2-3B-Instruct         | Llama3.2-3B-Instruct         | fireworks0    | {}         |
-+------------------------------+------------------------------+---------------+------------+
-| Llama3.2-11B-Vision-Instruct | Llama3.2-11B-Vision-Instruct | fireworks0    | {}         |
-+------------------------------+------------------------------+---------------+------------+
-| Llama3.2-90B-Vision-Instruct | Llama3.2-90B-Vision-Instruct | fireworks0    | {}         |
-+------------------------------+------------------------------+---------------+------------+
-```
--- a/distributions/fireworks/compose.yaml
+++ b/distributions/fireworks/compose.yaml
@ -1,13 +1,11 @@
 services:
  llamastack:
    image: llamastack/distribution-fireworks
-    network_mode: "host"
-    volumes:
-      - ~/.llama:/root/.llama
-      - ./run.yaml:/root/llamastack-run-fireworks.yaml
    ports:
-      - "5000:5000"
-    entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml_config /root/llamastack-run-fireworks.yaml"
+      - "8321:8321"
+    environment:
+      - FIREWORKS_API_KEY=${FIREWORKS_API_KEY}
+    entrypoint: bash -c "python -m llama_stack.distribution.server.server --template fireworks"
    deploy:
      restart_policy:
        condition: on-failure
--- a/distributions/fireworks/run.yaml
+++ b/distributions/fireworks/run.yaml
@ -1,51 +0,0 @@
-version: '2'
-built_at: '2024-10-08T17:40:45.325529'
-image_name: local
-docker_image: null
-conda_env: local
-apis:
- shields
- agents
- models
- memory
- memory_banks
- inference
- safety
-providers:
-  inference:
-  - provider_id: fireworks0
-    provider_type: remote::fireworks
-    config:
-      url: https://api.fireworks.ai/inference
-      # api_key: <ENTER_YOUR_API_KEY>
-  safety:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      llama_guard_shield:
-        model: Llama-Guard-3-1B
-        excluded_categories: []
-        disable_input_check: false
-        disable_output_check: false
-      prompt_guard_shield:
-        model: Prompt-Guard-86M
-  memory:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config: {}
-  # Uncomment to use weaviate memory provider
-  # - provider_id: weaviate0
-  #   provider_type: remote::weaviate
-  #   config: {}
-  agents:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      persistence_store:
-        namespace: null
-        type: sqlite
-        db_path: ~/.llama/runtime/kvstore.db
-  telemetry:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config: {}
--- a/distributions/fireworks/run.yaml
+++ b/distributions/fireworks/run.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/fireworks/run.yaml
--- a/distributions/hf-endpoint/build.yaml
+++ b/distributions/hf-endpoint/build.yaml
@ -1 +0,0 @@
-../../llama_stack/templates/hf-endpoint/build.yaml
--- a/distributions/hf-serverless/build.yaml
+++ b/distributions/hf-serverless/build.yaml
@ -1 +0,0 @@
-../../llama_stack/templates/hf-serverless/build.yaml
--- a/distributions/meta-reference-gpu/README.md
+++ b/distributions/meta-reference-gpu/README.md
@ -1,102 +0,0 @@
-# Meta Reference Distribution
-
-The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations.
-
-
-| **API**         	| **Inference** 	| **Agents**     	| **Memory**                                       	| **Safety**     	| **Telemetry**  	|
-|-----------------	|---------------	|----------------	|--------------------------------------------------	|----------------	|----------------	|
-| **Provider(s)** 	| meta-reference  	| meta-reference 	| meta-reference, remote::pgvector, remote::chroma 	| meta-reference 	| meta-reference 	|
-
-
-### Start the Distribution (Single Node GPU)
-
-```
-$ cd distributions/meta-reference-gpu
-$ ls
-build.yaml  compose.yaml  README.md  run.yaml
-$ docker compose up
-```
-
-> [!NOTE]
-> This assumes you have access to GPU to start a local server with access to your GPU.
-
-
-> [!NOTE]
-> `~/.llama` should be the path containing downloaded weights of Llama models.
-
-
-This will download and start running a pre-built docker container. Alternatively, you may use the following commands:
-
-```
-docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
-```
-
-### Alternative (Build and start distribution locally via conda)
- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on building locally via conda and starting up a meta-reference distribution.
-
-### Start Distribution With pgvector/chromadb Memory Provider
-##### pgvector
-1. Start running the pgvector server:
-
-```
-docker run --network host --name mypostgres -it -p 5432:5432 -e POSTGRES_PASSWORD=mysecretpassword -e POSTGRES_USER=postgres -e POSTGRES_DB=postgres pgvector/pgvector:pg16
-```
-
-2. Edit the `run.yaml` file to point to the pgvector server.
-```
-memory:
-  - provider_id: pgvector
-    provider_type: remote::pgvector
-    config:
-      host: 127.0.0.1
-      port: 5432
-      db: postgres
-      user: postgres
-      password: mysecretpassword
-```
-
-> [!NOTE]
-> If you get a `RuntimeError: Vector extension is not installed.`. You will need to run `CREATE EXTENSION IF NOT EXISTS vector;` to include the vector extension. E.g.
-
-```
-docker exec -it mypostgres ./bin/psql -U postgres
-postgres=# CREATE EXTENSION IF NOT EXISTS vector;
-postgres=# SELECT extname from pg_extension;
- extname
-```
-
-3. Run `docker compose up` with the updated `run.yaml` file.
-
-##### chromadb
-1. Start running chromadb server
-```
-docker run -it --network host --name chromadb -p 6000:6000 -v ./chroma_vdb:/chroma/chroma -e IS_PERSISTENT=TRUE chromadb/chroma:latest
-```
-
-2. Edit the `run.yaml` file to point to the chromadb server.
-```
-memory:
-  - provider_id: remote::chromadb
-    provider_type: remote::chromadb
-    config:
-      host: localhost
-      port: 6000
-```
-
-3. Run `docker compose up` with the updated `run.yaml` file.
-
-### Serving a new model
-You may change the `config.model` in `run.yaml` to update the model currently being served by the distribution. Make sure you have the model checkpoint downloaded in your `~/.llama`.
-```
-inference:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      model: Llama3.2-11B-Vision-Instruct
-      quantization: null
-      torch_seed: null
-      max_seq_len: 4096
-      max_batch_size: 1
-```
-
-Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
--- a/distributions/meta-reference-gpu/compose.yaml
+++ b/distributions/meta-reference-gpu/compose.yaml
@ -6,7 +6,7 @@ services:
      - ~/.llama:/root/.llama
      - ./run.yaml:/root/my-run.yaml
    ports:
-      - "5000:5000"
+      - "8321:8321"
    devices:
      - nvidia.com/gpu=all
    environment:
@ -25,11 +25,10 @@ services:
            # satisfy all the requested capabilities for a successful
            # reservation.
            capabilities: [gpu]
-    runtime: nvidia
-    entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
-    deploy:
      restart_policy:
        condition: on-failure
        delay: 3s
        max_attempts: 5
        window: 60s
+    runtime: nvidia
+    entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
--- a/distributions/meta-reference-gpu/run-with-safety.yaml
+++ b/distributions/meta-reference-gpu/run-with-safety.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/meta-reference-gpu/run-with-safety.yaml
--- a/distributions/meta-reference-gpu/run.yaml
+++ b/distributions/meta-reference-gpu/run.yaml
@ -1,59 +0,0 @@
-version: '2'
-built_at: '2024-10-08T17:40:45.325529'
-image_name: local
-docker_image: null
-conda_env: local
-apis:
- shields
- agents
- models
- memory
- memory_banks
- inference
- safety
-providers:
-  inference:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      model: Llama3.1-8B-Instruct
-      quantization: null
-      torch_seed: null
-      max_seq_len: 4096
-      max_batch_size: 1
-  safety:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      llama_guard_shield:
-        model: Llama-Guard-3-1B
-        excluded_categories: []
-        disable_input_check: false
-        disable_output_check: false
-      prompt_guard_shield:
-        model: Prompt-Guard-86M
-  memory:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config: {}
-  # Uncomment to use pgvector
-  # - provider_id: pgvector
-  #   provider_type: remote::pgvector
-  #   config:
-  #     host: 127.0.0.1
-  #     port: 5432
-  #     db: postgres
-  #     user: postgres
-  #     password: mysecretpassword
-  agents:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      persistence_store:
-        namespace: null
-        type: sqlite
-        db_path: ~/.llama/runtime/kvstore.db
-  telemetry:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config: {}
--- a/distributions/meta-reference-gpu/run.yaml
+++ b/distributions/meta-reference-gpu/run.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/meta-reference-gpu/run.yaml
--- a/distributions/meta-reference-quantized-gpu/README.md
+++ b/distributions/meta-reference-quantized-gpu/README.md
@ -1,34 +0,0 @@
-# Meta Reference Quantized Distribution
-
-The `llamastack/distribution-meta-reference-quantized-gpu` distribution consists of the following provider configurations.
-
-
-| **API**         	| **Inference**            	| **Agents**     	| **Memory**                                       	| **Safety**     	| **Telemetry**  	|
-|-----------------	|------------------------  	|----------------	|--------------------------------------------------	|----------------	|----------------	|
-| **Provider(s)** 	| meta-reference-quantized  | meta-reference 	| meta-reference, remote::pgvector, remote::chroma 	| meta-reference 	| meta-reference 	|
-
-The only difference vs. the `meta-reference-gpu` distribution is that it has support for more efficient inference -- with fp8, int4 quantization, etc.
-
-### Start the Distribution (Single Node GPU)
-
-> [!NOTE]
-> This assumes you have access to GPU to start a local server with access to your GPU.
-
-
-> [!NOTE]
-> `~/.llama` should be the path containing downloaded weights of Llama models.
-
-
-To download and start running a pre-built docker container, you may use the following commands:
-
-```
-docker run -it -p 5000:5000 -v ~/.llama:/root/.llama \
-  -v ./run.yaml:/root/my-run.yaml \
-  --gpus=all \
-  distribution-meta-reference-quantized-gpu \
-  --yaml_config /root/my-run.yaml
-```
-
-### Alternative (Build and start distribution locally via conda)
-
- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on building locally via conda and starting up the distribution.
--- a/distributions/meta-reference-quantized-gpu/compose.yaml
+++ b/distributions/meta-reference-quantized-gpu/compose.yaml
@ -6,7 +6,7 @@ services:
      - ~/.llama:/root/.llama
      - ./run.yaml:/root/my-run.yaml
    ports:
-      - "5000:5000"
+      - "8321:8321"
    devices:
      - nvidia.com/gpu=all
    environment:
--- a/distributions/meta-reference-quantized-gpu/run.yaml
+++ b/distributions/meta-reference-quantized-gpu/run.yaml
@ -1,7 +1,6 @@
 version: '2'
-built_at: '2024-10-08T17:40:45.325529'
 image_name: local
-docker_image: null
+container_image: null
 conda_env: local
 apis:
 - shields
@ -14,7 +13,7 @@ apis:
 providers:
  inference:
  - provider_id: meta0
-    provider_type: meta-reference-quantized
+    provider_type: inline::meta-reference-quantized
    config:
      model: Llama3.2-3B-Instruct:int4-qlora-eo8
      quantization:
@ -22,24 +21,32 @@ providers:
      torch_seed: null
      max_seq_len: 2048
      max_batch_size: 1
+  - provider_id: meta1
+    provider_type: inline::meta-reference-quantized
+    config:
+      # not a quantized model !
+      model: Llama-Guard-3-1B
+      quantization: null
+      torch_seed: null
+      max_seq_len: 2048
+      max_batch_size: 1
  safety:
  - provider_id: meta0
-    provider_type: meta-reference
+    provider_type: inline::llama-guard
    config:
-      llama_guard_shield:
-        model: Llama-Guard-3-1B
-        excluded_categories: []
-        disable_input_check: false
-        disable_output_check: false
-      prompt_guard_shield:
-        model: Prompt-Guard-86M
+      model: Llama-Guard-3-1B
+      excluded_categories: []
+  - provider_id: meta1
+    provider_type: inline::prompt-guard
+    config:
+      model: Prompt-Guard-86M
  memory:
  - provider_id: meta0
-    provider_type: meta-reference
+    provider_type: inline::meta-reference
    config: {}
  agents:
  - provider_id: meta0
-    provider_type: meta-reference
+    provider_type: inline::meta-reference
    config:
      persistence_store:
        namespace: null
@ -47,5 +54,5 @@ providers:
        db_path: ~/.llama/runtime/kvstore.db
  telemetry:
  - provider_id: meta0
-    provider_type: meta-reference
+    provider_type: inline::meta-reference
    config: {}
--- a/distributions/ollama/README.md
+++ b/distributions/ollama/README.md
@ -1,116 +0,0 @@
-# Ollama Distribution
-
-The `llamastack/distribution-ollama` distribution consists of the following provider configurations.
-
-| **API**         	| **Inference**  	| **Agents**     	| **Memory**                       	| **Safety**     	| **Telemetry**  	|
-|-----------------	|----------------	|----------------	|----------------------------------	|----------------	|----------------	|
-| **Provider(s)** 	| remote::ollama 	| meta-reference 	| remote::pgvector, remote::chroma 	| remote::ollama 	| meta-reference 	|
-
-
-### Start a Distribution (Single Node GPU)
-
-> [!NOTE]
-> This assumes you have access to GPU to start a Ollama server with access to your GPU.
-
-```
-$ cd distributions/ollama/gpu
-$ ls
-compose.yaml  run.yaml
-$ docker compose up
-```
-
-You will see outputs similar to following ---
-```
-[ollama]               | [GIN] 2024/10/18 - 21:19:41 | 200 |     226.841µs |             ::1 | GET      "/api/ps"
-[ollama]               | [GIN] 2024/10/18 - 21:19:42 | 200 |      60.908µs |             ::1 | GET      "/api/ps"
-INFO:     Started server process [1]
-INFO:     Waiting for application startup.
-INFO:     Application startup complete.
-INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
-[llamastack] | Resolved 12 providers
-[llamastack] |  inner-inference => ollama0
-[llamastack] |  models => __routing_table__
-[llamastack] |  inference => __autorouted__
-```
-
-To kill the server
-```
-docker compose down
-```
-
-### Start the Distribution (Single Node CPU)
-
-> [!NOTE]
-> This will start an ollama server with CPU only, please see [Ollama Documentations](https://github.com/ollama/ollama) for serving models on CPU only.
-
-```
-$ cd distributions/ollama/cpu
-$ ls
-compose.yaml  run.yaml
-$ docker compose up
-```
-
-### (Alternative) ollama run + llama stack run
-
-If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands.
-
-#### Start Ollama server.
- Please check the [Ollama Documentations](https://github.com/ollama/ollama) for more details.
-
-**Via Docker**
-```
-docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
-```
-
-**Via CLI**
-```
-ollama run <model_id>
-```
-
-#### Start Llama Stack server pointing to Ollama server
-
-**Via Docker**
-```
-docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./gpu/run.yaml:/root/llamastack-run-ollama.yaml --gpus=all llamastack/distribution-ollama --yaml_config /root/llamastack-run-ollama.yaml
-```
-
-Make sure in you `run.yaml` file, you inference provider is pointing to the correct Ollama endpoint. E.g.
-```
-inference:
-  - provider_id: ollama0
-    provider_type: remote::ollama
-    config:
-      url: http://127.0.0.1:14343
-```
-
-**Via Conda**
-
-```
-llama stack build --template ollama --image-type conda
-llama stack run ./gpu/run.yaml
-```
-
-### Model Serving
-
-To serve a new model with `ollama`
-```
-ollama run <model_name>
-```
-
-To make sure that the model is being served correctly, run `ollama ps` to get a list of models being served by ollama.
-```
-$ ollama ps
-
-NAME                         ID              SIZE     PROCESSOR    UNTIL
-llama3.1:8b-instruct-fp16    4aacac419454    17 GB    100% GPU     4 minutes from now
-```
-
-To verify that the model served by ollama is correctly connected to Llama Stack server
-```
-$ llama-stack-client models list
-+----------------------+----------------------+---------------+-----------------------------------------------+
-| identifier           | llama_model          | provider_id   | metadata                                      |
-+======================+======================+===============+===============================================+
-| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | ollama0       | {'ollama_model': 'llama3.1:8b-instruct-fp16'} |
-+----------------------+----------------------+---------------+-----------------------------------------------+
-```
--- a/distributions/ollama/compose.yaml
+++ b/distributions/ollama/compose.yaml
@ -0,0 +1,71 @@
+services:
+  ollama:
+    image: ollama/ollama:latest
+    network_mode: ${NETWORK_MODE:-bridge}
+    volumes:
+      - ~/.ollama:/root/.ollama
+    ports:
+      - "11434:11434"
+    environment:
+      OLLAMA_DEBUG: 1
+    command: []
+    deploy:
+      resources:
+        limits:
+          memory: 8G    # Set maximum memory
+        reservations:
+          memory: 8G    # Set minimum memory reservation
+    # healthcheck:
+    #   # ugh, no CURL in ollama image
+    #   test: ["CMD", "curl", "-f", "http://ollama:11434"]
+    #   interval: 10s
+    #   timeout: 5s
+    #   retries: 5
+
+  ollama-init:
+    image: ollama/ollama:latest
+    depends_on:
+      - ollama
+        # condition: service_healthy
+    network_mode: ${NETWORK_MODE:-bridge}
+    environment:
+      - OLLAMA_HOST=ollama
+      - INFERENCE_MODEL=${INFERENCE_MODEL}
+      - SAFETY_MODEL=${SAFETY_MODEL:-}
+    volumes:
+      - ~/.ollama:/root/.ollama
+      - ./pull-models.sh:/pull-models.sh
+    entrypoint: ["/pull-models.sh"]
+
+  llamastack:
+    depends_on:
+      ollama:
+        condition: service_started
+      ollama-init:
+        condition: service_started
+    image: ${LLAMA_STACK_IMAGE:-llamastack/distribution-ollama}
+    network_mode: ${NETWORK_MODE:-bridge}
+    volumes:
+      - ~/.llama:/root/.llama
+      # Link to ollama run.yaml file
+      - ~/local/llama-stack/:/app/llama-stack-source
+      - ./run${SAFETY_MODEL:+-with-safety}.yaml:/root/my-run.yaml
+    ports:
+      - "${LLAMA_STACK_PORT:-5001}:${LLAMA_STACK_PORT:-5001}"
+    environment:
+      - INFERENCE_MODEL=${INFERENCE_MODEL}
+      - SAFETY_MODEL=${SAFETY_MODEL:-}
+      - OLLAMA_URL=http://ollama:11434
+    entrypoint: >
+        python -m llama_stack.distribution.server.server /root/my-run.yaml \
+        --port ${LLAMA_STACK_PORT:-5001}
+    deploy:
+      restart_policy:
+        condition: on-failure
+        delay: 10s
+        max_attempts: 3
+        window: 60s
+volumes:
+  ollama:
+  ollama-init:
+  llamastack:
--- a/distributions/ollama/cpu/compose.yaml
+++ b/distributions/ollama/cpu/compose.yaml
@ -1,30 +0,0 @@
-services:
-  ollama:
-    image: ollama/ollama:latest
-    network_mode: "host"
-    volumes:
-      - ollama:/root/.ollama # this solution synchronizes with the docker volume and loads the model rocket fast
-    ports:
-      - "11434:11434"
-    command: []
-  llamastack:
-    depends_on:
-    - ollama
-    image: llamastack/distribution-ollama
-    network_mode: "host"
-    volumes:
-      - ~/.llama:/root/.llama
-      # Link to ollama run.yaml file
-      - ./run.yaml:/root/my-run.yaml
-    ports:
-      - "5000:5000"
-    # Hack: wait for ollama server to start before starting docker
-    entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
-    deploy:
-      restart_policy:
-        condition: on-failure
-        delay: 3s
-        max_attempts: 5
-        window: 60s
-volumes:
-  ollama:
--- a/distributions/ollama/cpu/run.yaml
+++ b/distributions/ollama/cpu/run.yaml
@ -1,46 +0,0 @@
-version: '2'
-built_at: '2024-10-08T17:40:45.325529'
-image_name: local
-docker_image: null
-conda_env: local
-apis:
- shields
- agents
- models
- memory
- memory_banks
- inference
- safety
-providers:
-  inference:
-  - provider_id: ollama0
-    provider_type: remote::ollama
-    config:
-      url: http://127.0.0.1:14343
-  safety:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      llama_guard_shield:
-        model: Llama-Guard-3-1B
-        excluded_categories: []
-        disable_input_check: false
-        disable_output_check: false
-      prompt_guard_shield:
-        model: Prompt-Guard-86M
-  memory:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config: {}
-  agents:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      persistence_store:
-        namespace: null
-        type: sqlite
-        db_path: ~/.llama/runtime/kvstore.db
-  telemetry:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config: {}
--- a/distributions/ollama/gpu/run.yaml
+++ b/distributions/ollama/gpu/run.yaml
@ -1,46 +0,0 @@
-version: '2'
-built_at: '2024-10-08T17:40:45.325529'
-image_name: local
-docker_image: null
-conda_env: local
-apis:
- shields
- agents
- models
- memory
- memory_banks
- inference
- safety
-providers:
-  inference:
-  - provider_id: ollama0
-    provider_type: remote::ollama
-    config:
-      url: http://127.0.0.1:14343
-  safety:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      llama_guard_shield:
-        model: Llama-Guard-3-1B
-        excluded_categories: []
-        disable_input_check: false
-        disable_output_check: false
-      prompt_guard_shield:
-        model: Prompt-Guard-86M
-  memory:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config: {}
-  agents:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      persistence_store:
-        namespace: null
-        type: sqlite
-        db_path: ~/.llama/runtime/kvstore.db
-  telemetry:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config: {}
--- a/distributions/ollama/pull-models.sh
+++ b/distributions/ollama/pull-models.sh
@ -0,0 +1,18 @@
+#!/bin/sh
+
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+echo "Preloading (${INFERENCE_MODEL}, ${SAFETY_MODEL})..."
+for model in ${INFERENCE_MODEL} ${SAFETY_MODEL}; do
+  echo "Preloading $model..."
+  if ! ollama run "$model"; then
+    echo "Failed to pull and run $model"
+    exit 1
+  fi
+done
+
+echo "All models pulled successfully"
--- a/distributions/ollama/run-with-safety.yaml
+++ b/distributions/ollama/run-with-safety.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/ollama/run-with-safety.yaml
--- a/distributions/ollama/run.yaml
+++ b/distributions/ollama/run.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/ollama/run.yaml
--- a/distributions/remote-nvidia/build.yaml
+++ b/distributions/remote-nvidia/build.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/nvidia/build.yaml
--- a/distributions/remote-nvidia/compose.yaml
+++ b/distributions/remote-nvidia/compose.yaml
@ -0,0 +1,19 @@
+services:
+  llamastack:
+    image: distribution-nvidia:dev
+    network_mode: "host"
+    volumes:
+      - ~/.llama:/root/.llama
+      - ./run.yaml:/root/llamastack-run-nvidia.yaml
+    ports:
+      - "8321:8321"
+    environment:
+      - INFERENCE_MODEL=${INFERENCE_MODEL:-Llama3.1-8B-Instruct}
+      - NVIDIA_API_KEY=${NVIDIA_API_KEY:-}
+    entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml-config /root/llamastack-run-nvidia.yaml"
+    deploy:
+      restart_policy:
+        condition: on-failure
+        delay: 3s
+        max_attempts: 5
+        window: 60s
--- a/distributions/remote-nvidia/run.yaml
+++ b/distributions/remote-nvidia/run.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/nvidia/run.yaml
--- a/distributions/remote-vllm/build.yaml
+++ b/distributions/remote-vllm/build.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/remote-vllm/build.yaml
--- a/distributions/remote-vllm/compose.yaml
+++ b/distributions/remote-vllm/compose.yaml
@ -0,0 +1,100 @@
+services:
+  vllm-inference:
+    image: vllm/vllm-openai:latest
+    volumes:
+      - $HOME/.cache/huggingface:/root/.cache/huggingface
+    network_mode: ${NETWORK_MODE:-bridged}
+    ports:
+       - "${VLLM_INFERENCE_PORT:-5100}:${VLLM_INFERENCE_PORT:-5100}"
+    devices:
+      - nvidia.com/gpu=all
+    environment:
+      - CUDA_VISIBLE_DEVICES=${VLLM_INFERENCE_GPU:-0}
+      - HUGGING_FACE_HUB_TOKEN=$HF_TOKEN
+    command: >
+      --gpu-memory-utilization 0.75
+      --model ${VLLM_INFERENCE_MODEL:-meta-llama/Llama-3.2-3B-Instruct}
+      --enforce-eager
+      --max-model-len 8192
+      --max-num-seqs 16
+      --port ${VLLM_INFERENCE_PORT:-5100}
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:${VLLM_INFERENCE_PORT:-5100}/v1/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 5
+    deploy:
+      resources:
+        reservations:
+          devices:
+          - driver: nvidia
+            capabilities: [gpu]
+    runtime: nvidia
+
+  # A little trick:
+  # if VLLM_SAFETY_MODEL is set, we will create a service for the safety model
+  # otherwise, the entry will end in a hyphen which gets ignored by docker compose
+  vllm-${VLLM_SAFETY_MODEL:+safety}:
+    image: vllm/vllm-openai:latest
+    volumes:
+      - $HOME/.cache/huggingface:/root/.cache/huggingface
+    network_mode: ${NETWORK_MODE:-bridged}
+    ports:
+      - "${VLLM_SAFETY_PORT:-5101}:${VLLM_SAFETY_PORT:-5101}"
+    devices:
+      - nvidia.com/gpu=all
+    environment:
+      - CUDA_VISIBLE_DEVICES=${VLLM_SAFETY_GPU:-1}
+      - HUGGING_FACE_HUB_TOKEN=$HF_TOKEN
+    command: >
+      --gpu-memory-utilization 0.75
+      --model ${VLLM_SAFETY_MODEL}
+      --enforce-eager
+      --max-model-len 8192
+      --max-num-seqs 16
+      --port ${VLLM_SAFETY_PORT:-5101}
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://localhost:${VLLM_SAFETY_PORT:-5101}/v1/health"]
+      interval: 30s
+      timeout: 10s
+      retries: 5
+    deploy:
+      resources:
+        reservations:
+          devices:
+          - driver: nvidia
+            capabilities: [gpu]
+    runtime: nvidia
+  llamastack:
+    depends_on:
+      - vllm-inference:
+          condition: service_healthy
+      - vllm-${VLLM_SAFETY_MODEL:+safety}:
+          condition: service_healthy
+    # image: llamastack/distribution-remote-vllm
+    image: llamastack/distribution-remote-vllm:test-0.0.52rc3
+    volumes:
+      - ~/.llama:/root/.llama
+      - ./run${VLLM_SAFETY_MODEL:+-with-safety}.yaml:/root/llamastack-run-remote-vllm.yaml
+    network_mode: ${NETWORK_MODE:-bridged}
+    environment:
+      - VLLM_URL=http://vllm-inference:${VLLM_INFERENCE_PORT:-5100}/v1
+      - VLLM_SAFETY_URL=http://vllm-safety:${VLLM_SAFETY_PORT:-5101}/v1
+      - INFERENCE_MODEL=${INFERENCE_MODEL:-meta-llama/Llama-3.2-3B-Instruct}
+      - MAX_TOKENS=${MAX_TOKENS:-4096}
+      - SQLITE_STORE_DIR=${SQLITE_STORE_DIR:-$HOME/.llama/distributions/remote-vllm}
+      - SAFETY_MODEL=${SAFETY_MODEL:-meta-llama/Llama-Guard-3-1B}
+    ports:
+      - "${LLAMA_STACK_PORT:-5001}:${LLAMA_STACK_PORT:-5001}"
+    # Hack: wait for vLLM server to start before starting docker
+    entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/llamastack-run-remote-vllm.yaml --port 5001"
+    deploy:
+      restart_policy:
+        condition: on-failure
+        delay: 3s
+        max_attempts: 5
+        window: 60s
+volumes:
+  vllm-inference:
+  vllm-safety:
+  llamastack:
--- a/distributions/remote-vllm/run-with-safety.yaml
+++ b/distributions/remote-vllm/run-with-safety.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/remote-vllm/run-with-safety.yaml
--- a/distributions/remote-vllm/run.yaml
+++ b/distributions/remote-vllm/run.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/remote-vllm/run.yaml
--- a/llama_stack/templates/vllm/build.yaml
+++ b/llama_stack/templates/vllm/build.yaml
@ -1,8 +1,8 @@
-name: vllm
+name: runpod
 distribution_spec:
-  description: Like local, but use vLLM for running LLM inference
+  description: Use Runpod for running LLM inference
  providers:
-    inference: vllm
+    inference: remote::runpod
    memory: meta-reference
    safety: meta-reference
    agents: meta-reference
--- a/distributions/sambanova/build.yaml
+++ b/distributions/sambanova/build.yaml
@ -0,0 +1,19 @@
+version: '2'
+name: sambanova
+distribution_spec:
+  description: Use SambaNova.AI for running LLM inference
+  docker_image: null
+  providers:
+    inference:
+    - remote::sambanova
+    memory:
+    - inline::faiss
+    - remote::chromadb
+    - remote::pgvector
+    safety:
+    - inline::llama-guard
+    agents:
+    - inline::meta-reference
+    telemetry:
+    - inline::meta-reference
+image_type: conda
--- a/distributions/sambanova/compose.yaml
+++ b/distributions/sambanova/compose.yaml
@ -0,0 +1,16 @@
+services:
+  llamastack:
+    image: llamastack/distribution-sambanova
+    network_mode: "host"
+    volumes:
+      - ~/.llama:/root/.llama
+      - ./run.yaml:/root/llamastack-run-sambanova.yaml
+    ports:
+      - "5000:5000"
+    entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml_config /root/llamastack-run-sambanova.yaml"
+    deploy:
+      restart_policy:
+        condition: on-failure
+        delay: 3s
+        max_attempts: 5
+        window: 60s
--- a/distributions/sambanova/run.yaml
+++ b/distributions/sambanova/run.yaml
@ -0,0 +1,83 @@
+version: '2'
+image_name: sambanova
+docker_image: null
+conda_env: sambanova
+apis:
+- agents
+- inference
+- memory
+- safety
+- telemetry
+providers:
+  inference:
+  - provider_id: sambanova
+    provider_type: remote::sambanova
+    config:
+      url: https://api.sambanova.ai/v1/
+      api_key: ${env.SAMBANOVA_API_KEY}
+  memory:
+  - provider_id: faiss
+    provider_type: inline::faiss
+    config:
+      kvstore:
+        type: sqlite
+        namespace: null
+        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/sambanova}/faiss_store.db
+  safety:
+  - provider_id: llama-guard
+    provider_type: inline::llama-guard
+    config: {}
+  agents:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      persistence_store:
+        type: sqlite
+        namespace: null
+        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/sambanova}/agents_store.db
+  telemetry:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config: {}
+metadata_store:
+  namespace: null
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/sambanova}/registry.db
+models:
+- metadata: {}
+  model_id: meta-llama/Llama-3.1-8B-Instruct
+  provider_id: null
+  provider_model_id: Meta-Llama-3.1-8B-Instruct
+- metadata: {}
+  model_id: meta-llama/Llama-3.1-70B-Instruct
+  provider_id: null
+  provider_model_id: Meta-Llama-3.1-70B-Instruct
+- metadata: {}
+  model_id: meta-llama/Llama-3.1-405B-Instruct
+  provider_id: null
+  provider_model_id: Meta-Llama-3.1-405B-Instruct
+- metadata: {}
+  model_id: meta-llama/Llama-3.2-1B-Instruct
+  provider_id: null
+  provider_model_id: Meta-Llama-3.2-1B-Instruct
+- metadata: {}
+  model_id: meta-llama/Llama-3.2-3B-Instruct
+  provider_id: null
+  provider_model_id: Meta-Llama-3.2-3B-Instruct
+- metadata: {}
+  model_id: meta-llama/Llama-3.2-11B-Vision-Instruct
+  provider_id: null
+  provider_model_id: Llama-3.2-11B-Vision-Instruct
+- metadata: {}
+  model_id: meta-llama/Llama-3.2-90B-Vision-Instruct
+  provider_id: null
+  provider_model_id: Llama-3.2-90B-Vision-Instruct
+shields:
+- params: null
+  shield_id: meta-llama/Llama-Guard-3-8B
+  provider_id: null
+  provider_shield_id: null
+memory_banks: []
+datasets: []
+scoring_fns: []
+eval_tasks: []
--- a/distributions/tgi/README.md
+++ b/distributions/tgi/README.md
@ -1,117 +0,0 @@
-# TGI Distribution
-
-The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
-
-
-| **API**         	| **Inference** 	| **Agents**     	| **Memory**                                       	| **Safety**     	| **Telemetry**  	|
-|-----------------	|---------------	|----------------	|--------------------------------------------------	|----------------	|----------------	|
-| **Provider(s)** 	| remote::tgi   	| meta-reference 	| meta-reference, remote::pgvector, remote::chroma 	| meta-reference 	| meta-reference 	|
-
-
-### Start the Distribution (Single Node GPU)
-
-> [!NOTE]
-> This assumes you have access to GPU to start a TGI server with access to your GPU.
-
-
-```
-$ cd distributions/tgi/gpu
-$ ls
-compose.yaml  tgi-run.yaml
-$ docker compose up
-```
-
-The script will first start up TGI server, then start up Llama Stack distribution server hooking up to the remote TGI provider for inference. You should be able to see the following outputs --
-```
-[text-generation-inference] | 2024-10-15T18:56:33.810397Z  INFO text_generation_router::server: router/src/server.rs:1813: Using config Some(Llama)
-[text-generation-inference] | 2024-10-15T18:56:33.810448Z  WARN text_generation_router::server: router/src/server.rs:1960: Invalid hostname, defaulting to 0.0.0.0
-[text-generation-inference] | 2024-10-15T18:56:33.864143Z  INFO text_generation_router::server: router/src/server.rs:2353: Connected
-INFO:     Started server process [1]
-INFO:     Waiting for application startup.
-INFO:     Application startup complete.
-INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
-```
-
-To kill the server
-```
-docker compose down
-```
-
-### Start the Distribution (Single Node CPU)
-
-> [!NOTE]
-> This assumes you have an hosted endpoint compatible with TGI server.
-
-```
-$ cd distributions/tgi/cpu
-$ ls
-compose.yaml  run.yaml
-$ docker compose up
-```
-
-Replace <ENTER_YOUR_TGI_HOSTED_ENDPOINT> in `run.yaml` file with your TGI endpoint.
-```
-inference:
-  - provider_id: tgi0
-    provider_type: remote::tgi
-    config:
-      url: <ENTER_YOUR_TGI_HOSTED_ENDPOINT>
-```
-
-### (Alternative) TGI server + llama stack run (Single Node GPU)
-
-If you wish to separately spin up a TGI server, and connect with Llama Stack, you may use the following commands.
-
-#### (optional) Start TGI server locally
- Please check the [TGI Getting Started Guide](https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#get-started) to get a TGI endpoint.
-
-```
-docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.1-8B-Instruct --port 5009
-```
-
-
-#### Start Llama Stack server pointing to TGI server
-
-```
-docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-tgi --yaml_config /root/my-run.yaml
-```
-
-Make sure in you `run.yaml` file, you inference provider is pointing to the correct TGI server endpoint. E.g.
-```
-inference:
-  - provider_id: tgi0
-    provider_type: remote::tgi
-    config:
-      url: http://127.0.0.1:5009
-```
-
-**Via Conda**
-
-```bash
-llama stack build --template tgi --image-type conda
-# -- start a TGI server endpoint
-llama stack run ./gpu/run.yaml
-```
-
-### Model Serving
-To serve a new model with `tgi`, change the docker command flag `--model-id <model-to-serve>`.
-
-This can be done by edit the `command` args in `compose.yaml`. E.g. Replace "Llama-3.2-1B-Instruct" with the model you want to serve.
-
-```
-command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.2-1B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"]
-```
-
-or by changing the docker run command's `--model-id` flag
-```
-docker run --rm -it -v $HOME/.cache/huggingface:/data -p 5009:5009 --gpus all ghcr.io/huggingface/text-generation-inference:latest --dtype bfloat16 --usage-stats on --sharded false --model-id meta-llama/Llama-3.2-1B-Instruct --port 5009
-```
-
-In `run.yaml`, make sure you point the correct server endpoint to the TGI server endpoint serving your model.
-```
-inference:
-  - provider_id: tgi0
-    provider_type: remote::tgi
-    config:
-      url: http://127.0.0.1:5009
-```
--- a/distributions/tgi/compose.yaml
+++ b/distributions/tgi/compose.yaml
@ -0,0 +1,103 @@
+services:
+  tgi-inference:
+    image: ghcr.io/huggingface/text-generation-inference:latest
+    volumes:
+      - $HOME/.cache/huggingface:/data
+    network_mode: ${NETWORK_MODE:-bridged}
+    ports:
+       - "${TGI_INFERENCE_PORT:-8080}:${TGI_INFERENCE_PORT:-8080}"
+    devices:
+      - nvidia.com/gpu=all
+    environment:
+      - CUDA_VISIBLE_DEVICES=${TGI_INFERENCE_GPU:-0}
+      - HF_TOKEN=$HF_TOKEN
+      - HF_HOME=/data
+      - HF_DATASETS_CACHE=/data
+      - HF_MODULES_CACHE=/data
+      - HF_HUB_CACHE=/data
+    command: >
+      --dtype bfloat16
+      --usage-stats off
+      --sharded false
+      --model-id ${TGI_INFERENCE_MODEL:-meta-llama/Llama-3.2-3B-Instruct}
+      --port ${TGI_INFERENCE_PORT:-8080}
+      --cuda-memory-fraction 0.75
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://tgi-inference:${TGI_INFERENCE_PORT:-8080}/health"]
+      interval: 5s
+      timeout: 5s
+      retries: 30
+    deploy:
+      resources:
+        reservations:
+          devices:
+          - driver: nvidia
+            capabilities: [gpu]
+    runtime: nvidia
+
+  tgi-${TGI_SAFETY_MODEL:+safety}:
+    image: ghcr.io/huggingface/text-generation-inference:latest
+    volumes:
+      - $HOME/.cache/huggingface:/data
+    network_mode: ${NETWORK_MODE:-bridged}
+    ports:
+       - "${TGI_SAFETY_PORT:-8081}:${TGI_SAFETY_PORT:-8081}"
+    devices:
+      - nvidia.com/gpu=all
+    environment:
+      - CUDA_VISIBLE_DEVICES=${TGI_SAFETY_GPU:-1}
+      - HF_TOKEN=$HF_TOKEN
+      - HF_HOME=/data
+      - HF_DATASETS_CACHE=/data
+      - HF_MODULES_CACHE=/data
+      - HF_HUB_CACHE=/data
+    command: >
+      --dtype bfloat16
+      --usage-stats off
+      --sharded false
+      --model-id ${TGI_SAFETY_MODEL:-meta-llama/Llama-Guard-3-1B}
+      --port ${TGI_SAFETY_PORT:-8081}
+      --cuda-memory-fraction 0.75
+    healthcheck:
+      test: ["CMD", "curl", "-f", "http://tgi-safety:${TGI_SAFETY_PORT:-8081}/health"]
+      interval: 5s
+      timeout: 5s
+      retries: 30
+    deploy:
+      resources:
+        reservations:
+          devices:
+          - driver: nvidia
+            capabilities: [gpu]
+    runtime: nvidia
+
+  llamastack:
+    depends_on:
+      tgi-inference:
+        condition: service_healthy
+      tgi-${TGI_SAFETY_MODEL:+safety}:
+        condition: service_healthy
+    image: llamastack/distribution-tgi:test-0.0.52rc3
+    network_mode: ${NETWORK_MODE:-bridged}
+    volumes:
+      - ~/.llama:/root/.llama
+      - ./run${TGI_SAFETY_MODEL:+-with-safety}.yaml:/root/my-run.yaml
+    ports:
+      - "${LLAMA_STACK_PORT:-5001}:${LLAMA_STACK_PORT:-5001}"
+    # Hack: wait for TGI server to start before starting docker
+    entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
+    restart_policy:
+      condition: on-failure
+      delay: 3s
+      max_attempts: 5
+      window: 60s
+    environment:
+      - TGI_URL=http://tgi-inference:${TGI_INFERENCE_PORT:-8080}
+      - SAFETY_TGI_URL=http://tgi-safety:${TGI_SAFETY_PORT:-8081}
+      - INFERENCE_MODEL=${INFERENCE_MODEL:-meta-llama/Llama-3.2-3B-Instruct}
+      - SAFETY_MODEL=${SAFETY_MODEL:-meta-llama/Llama-Guard-3-1B}
+
+volumes:
+  tgi-inference:
+  tgi-safety:
+  llamastack:
--- a/distributions/tgi/cpu/compose.yaml
+++ b/distributions/tgi/cpu/compose.yaml
@ -1,33 +0,0 @@
-services:
-  text-generation-inference:
-    image: ghcr.io/huggingface/text-generation-inference:latest
-    network_mode: "host"
-    volumes:
-      - $HOME/.cache/huggingface:/data
-    ports:
-      - "5009:5009"
-    command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.1-8B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"]
-    runtime: nvidia
-    healthcheck:
-      test: ["CMD", "curl", "-f", "http://text-generation-inference:5009/health"]
-      interval: 5s
-      timeout: 5s
-      retries: 30
-  llamastack:
-    depends_on:
-      text-generation-inference:
-        condition: service_healthy
-    image: llamastack/llamastack-local-cpu
-    network_mode: "host"
-    volumes:
-      - ~/.llama:/root/.llama
-      # Link to run.yaml file
-      - ./run.yaml:/root/my-run.yaml
-    ports:
-      - "5000:5000"
-    entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
-    restart_policy:
-      condition: on-failure
-      delay: 3s
-      max_attempts: 5
-      window: 60s
--- a/distributions/tgi/cpu/run.yaml
+++ b/distributions/tgi/cpu/run.yaml
@ -1,46 +0,0 @@
-version: '2'
-built_at: '2024-10-08T17:40:45.325529'
-image_name: local
-docker_image: null
-conda_env: local
-apis:
- shields
- agents
- models
- memory
- memory_banks
- inference
- safety
-providers:
-  inference:
-  - provider_id: tgi0
-    provider_type: remote::tgi
-    config:
-      url: <ENTER_YOUR_TGI_HOSTED_ENDPOINT>
-  safety:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      llama_guard_shield:
-        model: Llama-Guard-3-1B
-        excluded_categories: []
-        disable_input_check: false
-        disable_output_check: false
-      prompt_guard_shield:
-        model: Prompt-Guard-86M
-  memory:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config: {}
-  agents:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      persistence_store:
-        namespace: null
-        type: sqlite
-        db_path: ~/.llama/runtime/kvstore.db
-  telemetry:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config: {}
--- a/distributions/tgi/gpu/compose.yaml
+++ b/distributions/tgi/gpu/compose.yaml
@ -1,55 +0,0 @@
-services:
-  text-generation-inference:
-    image: ghcr.io/huggingface/text-generation-inference:latest
-    network_mode: "host"
-    volumes:
-      - $HOME/.cache/huggingface:/data
-    ports:
-      - "5009:5009"
-    devices:
-      - nvidia.com/gpu=all
-    environment:
-      - CUDA_VISIBLE_DEVICES=0
-      - HF_HOME=/data
-      - HF_DATASETS_CACHE=/data
-      - HF_MODULES_CACHE=/data
-      - HF_HUB_CACHE=/data
-    command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.1-8B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"]
-    deploy:
-      resources:
-        reservations:
-          devices:
-          - driver: nvidia
-            # that's the closest analogue to --gpus; provide
-            # an integer amount of devices or 'all'
-            count: 1
-            # Devices are reserved using a list of capabilities, making
-            # capabilities the only required field. A device MUST
-            # satisfy all the requested capabilities for a successful
-            # reservation.
-            capabilities: [gpu]
-    runtime: nvidia
-    healthcheck:
-      test: ["CMD", "curl", "-f", "http://text-generation-inference:5009/health"]
-      interval: 5s
-      timeout: 5s
-      retries: 30
-  llamastack:
-    depends_on:
-      text-generation-inference:
-        condition: service_healthy
-    image: llamastack/distribution-tgi
-    network_mode: "host"
-    volumes:
-      - ~/.llama:/root/.llama
-      # Link to TGI run.yaml file
-      - ./run.yaml:/root/my-run.yaml
-    ports:
-      - "5000:5000"
-    # Hack: wait for TGI server to start before starting docker
-    entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
-    restart_policy:
-      condition: on-failure
-      delay: 3s
-      max_attempts: 5
-      window: 60s
--- a/distributions/tgi/gpu/run.yaml
+++ b/distributions/tgi/gpu/run.yaml
@ -1,46 +0,0 @@
-version: '2'
-built_at: '2024-10-08T17:40:45.325529'
-image_name: local
-docker_image: null
-conda_env: local
-apis:
- shields
- agents
- models
- memory
- memory_banks
- inference
- safety
-providers:
-  inference:
-  - provider_id: tgi0
-    provider_type: remote::tgi
-    config:
-      url: http://127.0.0.1:5009
-  safety:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      llama_guard_shield:
-        model: Llama-Guard-3-1B
-        excluded_categories: []
-        disable_input_check: false
-        disable_output_check: false
-      prompt_guard_shield:
-        model: Prompt-Guard-86M
-  memory:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config: {}
-  agents:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      persistence_store:
-        namespace: null
-        type: sqlite
-        db_path: ~/.llama/runtime/kvstore.db
-  telemetry:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config: {}
--- a/distributions/tgi/run-with-safety.yaml
+++ b/distributions/tgi/run-with-safety.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/tgi/run-with-safety.yaml
--- a/distributions/tgi/run.yaml
+++ b/distributions/tgi/run.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/tgi/run.yaml
--- a/distributions/together/README.md
+++ b/distributions/together/README.md
@ -11,7 +11,7 @@ The `llamastack/distribution-together` distribution consists of the following pr
 | **Provider(s)** 	| remote::together   	| meta-reference 	| meta-reference, remote::weaviate 	| meta-reference 	| meta-reference 	|


-### Start the Distribution (Single Node CPU)
+### Docker: Start the Distribution (Single Node CPU)

 > [!NOTE]
 > This assumes you have an hosted endpoint at Together with API Key.
@ -33,23 +33,7 @@ inference:
      api_key: <optional api key>
 ```

-### (Alternative) llama stack run (Single Node CPU)
-
-```
-docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-together --yaml_config /root/my-run.yaml
-```
-
-Make sure in you `run.yaml` file, you inference provider is pointing to the correct Together URL server endpoint. E.g.
-```
-inference:
-  - provider_id: together
-    provider_type: remote::together
-    config:
-      url: https://api.together.xyz/v1
-      api_key: <optional api key>
-```
-
-**Via Conda**
+### Conda llama stack run (Single Node CPU)

 ```bash
 llama stack build --template together --image-type conda
@ -57,7 +41,7 @@ llama stack build --template together --image-type conda
 llama stack run ./run.yaml
 ```

-### Model Serving
+### (Optional) Update Model Serving Configuration

 Use `llama-stack-client models list` to check the available models served by together.

--- a/distributions/together/compose.yaml
+++ b/distributions/together/compose.yaml
@ -1,13 +1,11 @@
 services:
  llamastack:
    image: llamastack/distribution-together
-    network_mode: "host"
-    volumes:
-      - ~/.llama:/root/.llama
-      - ./run.yaml:/root/llamastack-run-together.yaml
    ports:
-      - "5000:5000"
-    entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml_config /root/llamastack-run-together.yaml"
+      - "8321:8321"
+    environment:
+      - TOGETHER_API_KEY=${TOGETHER_API_KEY}
+    entrypoint: bash -c "python -m llama_stack.distribution.server.server --template together"
    deploy:
      restart_policy:
        condition: on-failure
--- a/distributions/together/run.yaml
+++ b/distributions/together/run.yaml
@ -1,47 +0,0 @@
-version: '2'
-built_at: '2024-10-08T17:40:45.325529'
-image_name: local
-docker_image: null
-conda_env: local
-apis:
- shields
- agents
- models
- memory
- memory_banks
- inference
- safety
-providers:
-  inference:
-  - provider_id: together0
-    provider_type: remote::together
-    config:
-      url: https://api.together.xyz/v1
-      # api_key: <ENTER_YOUR_API_KEY>
-  safety:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      llama_guard_shield:
-        model: Llama-Guard-3-1B
-        excluded_categories: []
-        disable_input_check: false
-        disable_output_check: false
-      prompt_guard_shield:
-        model: Prompt-Guard-86M
-  memory:
-  - provider_id: meta0
-    provider_type: remote::weaviate
-    config: {}
-  agents:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config:
-      persistence_store:
-        namespace: null
-        type: sqlite
-        db_path: ~/.llama/runtime/kvstore.db
-  telemetry:
-  - provider_id: meta0
-    provider_type: meta-reference
-    config: {}
--- a/distributions/together/run.yaml
+++ b/distributions/together/run.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/together/run.yaml
--- a/distributions/vllm-gpu/build.yaml
+++ b/distributions/vllm-gpu/build.yaml
@ -0,0 +1 @@
+../../llama_stack/templates/inline-vllm/build.yaml
--- a/distributions/ollama/gpu/compose.yaml
+++ b/distributions/ollama/gpu/compose.yaml
@ -1,11 +1,12 @@
 services:
-  ollama:
-    image: ollama/ollama:latest
+  llamastack:
+    image: llamastack/distribution-inline-vllm
    network_mode: "host"
    volumes:
-      - ollama:/root/.ollama # this solution synchronizes with the docker volume and loads the model rocket fast
+      - ~/.llama:/root/.llama
+      - ./run.yaml:/root/my-run.yaml
    ports:
-      - "11434:11434"
+      - "8321:8321"
    devices:
      - nvidia.com/gpu=all
    environment:
@ -25,24 +26,10 @@ services:
            # reservation.
            capabilities: [gpu]
    runtime: nvidia
-  llamastack:
-    depends_on:
-    - ollama
-    image: llamastack/distribution-ollama
-    network_mode: "host"
-    volumes:
-      - ~/.llama:/root/.llama
-      # Link to ollama run.yaml file
-      - ./run.yaml:/root/llamastack-run-ollama.yaml
-    ports:
-      - "5000:5000"
-    # Hack: wait for ollama server to start before starting docker
-    entrypoint: bash -c "sleep 60; python -m llama_stack.distribution.server.server --yaml_config /root/llamastack-run-ollama.yaml"
+    entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml_config /root/my-run.yaml"
    deploy:
      restart_policy:
        condition: on-failure
        delay: 3s
        max_attempts: 5
        window: 60s
-volumes:
-  ollama:
--- a/distributions/vllm-gpu/run.yaml
+++ b/distributions/vllm-gpu/run.yaml
@ -0,0 +1,66 @@
+version: '2'
+image_name: local
+container_image: null
+conda_env: local
+apis:
+- shields
+- agents
+- models
+- memory
+- memory_banks
+- inference
+- safety
+providers:
+  inference:
+  - provider_id: vllm-inference
+    provider_type: inline::vllm
+    config:
+      model: Llama3.2-3B-Instruct
+      tensor_parallel_size: 1
+      gpu_memory_utilization: 0.4
+      enforce_eager: true
+      max_tokens: 4096
+  - provider_id: vllm-inference-safety
+    provider_type: inline::vllm
+    config:
+      model: Llama-Guard-3-1B
+      tensor_parallel_size: 1
+      gpu_memory_utilization: 0.2
+      enforce_eager: true
+      max_tokens: 4096
+  safety:
+  - provider_id: meta0
+    provider_type: inline::llama-guard
+    config:
+      model: Llama-Guard-3-1B
+      excluded_categories: []
+  # Uncomment to use prompt guard
+  # - provider_id: meta1
+  #   provider_type: inline::prompt-guard
+  #   config:
+  #     model: Prompt-Guard-86M
+  memory:
+  - provider_id: meta0
+    provider_type: inline::meta-reference
+    config: {}
+  # Uncomment to use pgvector
+  # - provider_id: pgvector
+  #   provider_type: remote::pgvector
+  #   config:
+  #     host: 127.0.0.1
+  #     port: 5432
+  #     db: postgres
+  #     user: postgres
+  #     password: mysecretpassword
+  agents:
+  - provider_id: meta0
+    provider_type: inline::meta-reference
+    config:
+      persistence_store:
+        namespace: null
+        type: sqlite
+        db_path: ~/.llama/runtime/agents_store.db
+  telemetry:
+  - provider_id: meta0
+    provider_type: inline::meta-reference
+    config: {}
--- a/distributions/vllm/build.yaml
+++ b/distributions/vllm/build.yaml
@ -1 +0,0 @@
-../../llama_stack/templates/vllm/build.yaml
--- a/docs/_static/css/my_theme.css
+++ b/docs/_static/css/my_theme.css
@ -0,0 +1,14 @@
+@import url("theme.css");
+
+.wy-nav-content {
+    max-width: 90%;
+}
+
+.wy-nav-side {
+    /* background: linear-gradient(45deg, #2980B9, #16A085); */
+    background: linear-gradient(90deg, #332735, #1b263c);
+}
+
+.wy-side-nav-search {
+    background-color: transparent !important;
+}
--- a/docs/_static/llama-stack.png
+++ b/docs/_static/llama-stack.png
--- a/docs/_static/remote_or_local.gif
+++ b/docs/_static/remote_or_local.gif
--- a/docs/_static/safety_system.webp
+++ b/docs/_static/safety_system.webp
--- a/docs/building_distro.md
+++ b/docs/building_distro.md
@ -1,270 +0,0 @@
-# Building a Llama Stack Distribution
-
-This guide will walk you through the steps to get started with building a Llama Stack distributiom from scratch with your choice of API providers. Please see the [Getting Started Guide](./getting_started.md) if you just want the basic steps to start a Llama Stack distribution.
-
-## Step 1. Build
-In the following steps, imagine we'll be working with a `Meta-Llama3.1-8B-Instruct` model. We will name our build `8b-instruct` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify:
- `name`: the name for our distribution (e.g. `8b-instruct`)
- `image_type`: our build image type (`conda | docker`)
- `distribution_spec`: our distribution specs for specifying API providers
-  - `description`: a short description of the configurations for the distribution
-  - `providers`: specifies the underlying implementation for serving each API endpoint
-  - `image_type`: `conda` | `docker` to specify whether to build the distribution in the form of Docker image or Conda environment.
-
-
-At the end of build command, we will generate `<name>-build.yaml` file storing the build configurations.
-
-After this step is complete, a file named `<name>-build.yaml` will be generated and saved at the output file path specified at the end of the command.
-
-#### Building from scratch
- For a new user, we could start off with running `llama stack build` which will allow you to a interactively enter wizard where you will be prompted to enter build configurations.
-```
-llama stack build
-```
-
-Running the command above will allow you to fill in the configuration to build your Llama Stack distribution, you will see the following outputs.
-
-```
-> Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): 8b-instruct
-> Enter the image type you want your distribution to be built with (docker or conda): conda
-
- Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs.
-> Enter the API provider for the inference API: (default=meta-reference): meta-reference
-> Enter the API provider for the safety API: (default=meta-reference): meta-reference
-> Enter the API provider for the agents API: (default=meta-reference): meta-reference
-> Enter the API provider for the memory API: (default=meta-reference): meta-reference
-> Enter the API provider for the telemetry API: (default=meta-reference): meta-reference
-
- > (Optional) Enter a short description for your Llama Stack distribution:
-
-Build spec configuration saved at ~/.conda/envs/llamastack-my-local-llama-stack/8b-instruct-build.yaml
-```
-
-**Ollama (optional)**
-
-If you plan to use Ollama for inference, you'll need to install the server [via these instructions](https://ollama.com/download).
-
-
-#### Building from templates
- To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers.
-
-The following command will allow you to see the available templates and their corresponding providers.
-```
-llama stack build --list-templates
-```
-
-![alt text](resources/list-templates.png)
-
-You may then pick a template to build your distribution with providers fitted to your liking.
-
-```
-llama stack build --template tgi
-```
-
-```
-$ llama stack build --template tgi
-...
-...
-Build spec configuration saved at ~/.conda/envs/llamastack-tgi/tgi-build.yaml
-You may now run `llama stack configure tgi` or `llama stack configure ~/.conda/envs/llamastack-tgi/tgi-build.yaml`
-```
-
-#### Building from config file
- In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command.
-
- The config file will be of contents like the ones in `llama_stack/distributions/templates/`.
-
-```
-$ cat llama_stack/templates/ollama/build.yaml
-
-name: ollama
-distribution_spec:
-  description: Like local, but use ollama for running LLM inference
-  providers:
-    inference: remote::ollama
-    memory: meta-reference
-    safety: meta-reference
-    agents: meta-reference
-    telemetry: meta-reference
-image_type: conda
-```
-
-```
-llama stack build --config llama_stack/templates/ollama/build.yaml
-```
-
-#### How to build distribution with Docker image
-
-> [!TIP]
-> Podman is supported as an alternative to Docker. Set `DOCKER_BINARY` to `podman` in your environment to use Podman.
-
-To build a docker image, you may start off from a template and use the `--image-type docker` flag to specify `docker` as the build image type.
-
-```
-llama stack build --template local --image-type docker
-```
-
-Alternatively, you may use a config file and set `image_type` to `docker` in our `<name>-build.yaml` file, and run `llama stack build <name>-build.yaml`. The `<name>-build.yaml` will be of contents like:
-
-```
-name: local-docker-example
-distribution_spec:
-  description: Use code from `llama_stack` itself to serve all llama stack APIs
-  docker_image: null
-  providers:
-    inference: meta-reference
-    memory: meta-reference-faiss
-    safety: meta-reference
-    agentic_system: meta-reference
-    telemetry: console
-image_type: docker
-```
-
-The following command allows you to build a Docker image with the name `<name>`
-```
-llama stack build --config <name>-build.yaml
-
-Dockerfile created successfully in /tmp/tmp.I0ifS2c46A/DockerfileFROM python:3.10-slim
-WORKDIR /app
-...
-...
-You can run it with: podman run -p 8000:8000 llamastack-docker-local
-Build spec configuration saved at ~/.llama/distributions/docker/docker-local-build.yaml
-```
-
-
-## Step 2. Configure
-After our distribution is built (either in form of docker or conda environment), we will run the following command to
-```
-llama stack configure [ <docker-image-name> | <path/to/name.build.yaml>]
-```
- For `conda` environments: <path/to/name.build.yaml> would be the generated build spec saved from Step 1.
- For `docker` images downloaded from Dockerhub, you could also use <docker-image-name> as the argument.
-   - Run `docker images` to check list of available images on your machine.
-
-```
-$ llama stack configure tgi
-
-Configuring API: inference (meta-reference)
-Enter value for model (existing: Meta-Llama3.1-8B-Instruct) (required):
-Enter value for quantization (optional):
-Enter value for torch_seed (optional):
-Enter value for max_seq_len (existing: 4096) (required):
-Enter value for max_batch_size (existing: 1) (required):
-
-Configuring API: memory (meta-reference-faiss)
-
-Configuring API: safety (meta-reference)
-Do you want to configure llama_guard_shield? (y/n): y
-Entering sub-configuration for llama_guard_shield:
-Enter value for model (default: Llama-Guard-3-1B) (required):
-Enter value for excluded_categories (default: []) (required):
-Enter value for disable_input_check (default: False) (required):
-Enter value for disable_output_check (default: False) (required):
-Do you want to configure prompt_guard_shield? (y/n): y
-Entering sub-configuration for prompt_guard_shield:
-Enter value for model (default: Prompt-Guard-86M) (required):
-
-Configuring API: agentic_system (meta-reference)
-Enter value for brave_search_api_key (optional):
-Enter value for bing_search_api_key (optional):
-Enter value for wolfram_api_key (optional):
-
-Configuring API: telemetry (console)
-
-YAML configuration has been written to ~/.llama/builds/conda/tgi-run.yaml
-```
-
-After this step is successful, you should be able to find a run configuration spec in `~/.llama/builds/conda/tgi-run.yaml` with the following contents. You may edit this file to change the settings.
-
-As you can see, we did basic configuration above and configured:
- inference to run on model `Meta-Llama3.1-8B-Instruct` (obtained from `llama model list`)
- Llama Guard safety shield with model `Llama-Guard-3-1B`
- Prompt Guard safety shield with model `Prompt-Guard-86M`
-
-For how these configurations are stored as yaml, checkout the file printed at the end of the configuration.
-
-Note that all configurations as well as models are stored in `~/.llama`
-
-
-## Step 3. Run
-Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack configure` step.
-
-```
-llama stack run 8b-instruct
-```
-
-You should see the Llama Stack server start and print the APIs that it is supporting
-
-```
-$ llama stack run 8b-instruct
-
-> initializing model parallel with size 1
-> initializing ddp with size 1
-> initializing pipeline with size 1
-Loaded in 19.28 seconds
-NCCL version 2.20.5+cuda12.4
-Finished model load YES READY
-Serving POST /inference/batch_chat_completion
-Serving POST /inference/batch_completion
-Serving POST /inference/chat_completion
-Serving POST /inference/completion
-Serving POST /safety/run_shield
-Serving POST /agentic_system/memory_bank/attach
-Serving POST /agentic_system/create
-Serving POST /agentic_system/session/create
-Serving POST /agentic_system/turn/create
-Serving POST /agentic_system/delete
-Serving POST /agentic_system/session/delete
-Serving POST /agentic_system/memory_bank/detach
-Serving POST /agentic_system/session/get
-Serving POST /agentic_system/step/get
-Serving POST /agentic_system/turn/get
-Listening on :::5000
-INFO:     Started server process [453333]
-INFO:     Waiting for application startup.
-INFO:     Application startup complete.
-INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
-```
-
-> [!NOTE]
-> Configuration is in `~/.llama/builds/local/conda/tgi-run.yaml`. Feel free to increase `max_seq_len`.
-
-> [!IMPORTANT]
-> The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines.
-
-> [!TIP]
-> You might need to use the flag `--disable-ipv6` to  Disable IPv6 support
-
-This server is running a Llama model locally.
-
-## Step 4. Test with Client
-Once the server is setup, we can test it with a client to see the example outputs.
-```
-cd /path/to/llama-stack
-conda activate <env>  # any environment containing the llama-stack pip package will work
-
-python -m llama_stack.apis.inference.client localhost 5000
-```
-
-This will run the chat completion client and query the distribution’s /inference/chat_completion API.
-
-Here is an example output:
-```
-User>hello world, write me a 2 sentence poem about the moon
-Assistant> Here's a 2-sentence poem about the moon:
-
-The moon glows softly in the midnight sky,
-A beacon of wonder, as it passes by.
-```
-
-Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by:
-
-```
-python -m llama_stack.apis.safety.client localhost 5000
-```
-
-
-Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications.
-
-You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
--- a/docs/cli_reference.md
+++ b/docs/cli_reference.md
@ -1,485 +0,0 @@
-# Llama CLI Reference
-
-The `llama` CLI tool helps you setup and use the Llama Stack & agentic systems. It should be available on your path after installing the `llama-stack` package.
-
-### Subcommands
-1. `download`: `llama` cli tools supports downloading the model from Meta or Hugging Face.
-2. `model`: Lists available models and their properties.
-3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this [here](cli_reference.md#step-3-building-and-configuring-llama-stack-distributions).
-
-### Sample Usage
-
-```
-llama --help
-```
-<pre style="font-family: monospace;">
-usage: llama [-h] {download,model,stack} ...
-
-Welcome to the Llama CLI
-
-options:
-  -h, --help            show this help message and exit
-
-subcommands:
-  {download,model,stack}
-</pre>
-
-## Step 1. Get the models
-
-You first need to have models downloaded locally.
-
-To download any model you need the **Model Descriptor**.
-This can be obtained by running the command
-```
-llama model list
-```
-
-You should see a table like this:
-
-<pre style="font-family: monospace;">
-+----------------------------------+------------------------------------------+----------------+
-| Model Descriptor                 | Hugging Face Repo                        | Context Length |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-8B                      | meta-llama/Llama-3.1-8B                  | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-70B                     | meta-llama/Llama-3.1-70B                 | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-405B:bf16-mp8           | meta-llama/Llama-3.1-405B                | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-405B                    | meta-llama/Llama-3.1-405B-FP8            | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-405B:bf16-mp16          | meta-llama/Llama-3.1-405B                | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-8B-Instruct             | meta-llama/Llama-3.1-8B-Instruct         | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-70B-Instruct            | meta-llama/Llama-3.1-70B-Instruct        | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-405B-Instruct:bf16-mp8  | meta-llama/Llama-3.1-405B-Instruct       | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-405B-Instruct           | meta-llama/Llama-3.1-405B-Instruct-FP8   | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-405B-Instruct:bf16-mp16 | meta-llama/Llama-3.1-405B-Instruct       | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-1B                      | meta-llama/Llama-3.2-1B                  | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-3B                      | meta-llama/Llama-3.2-3B                  | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-11B-Vision              | meta-llama/Llama-3.2-11B-Vision          | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-90B-Vision              | meta-llama/Llama-3.2-90B-Vision          | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-1B-Instruct             | meta-llama/Llama-3.2-1B-Instruct         | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-3B-Instruct             | meta-llama/Llama-3.2-3B-Instruct         | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-11B-Vision-Instruct     | meta-llama/Llama-3.2-11B-Vision-Instruct | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-90B-Vision-Instruct     | meta-llama/Llama-3.2-90B-Vision-Instruct | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama-Guard-3-11B-Vision         | meta-llama/Llama-Guard-3-11B-Vision      | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama-Guard-3-1B:int4-mp1        | meta-llama/Llama-Guard-3-1B-INT4         | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama-Guard-3-1B                 | meta-llama/Llama-Guard-3-1B              | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama-Guard-3-8B                 | meta-llama/Llama-Guard-3-8B              | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama-Guard-3-8B:int8-mp1        | meta-llama/Llama-Guard-3-8B-INT8         | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Prompt-Guard-86M                 | meta-llama/Prompt-Guard-86M              | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama-Guard-2-8B                 | meta-llama/Llama-Guard-2-8B              | 4K             |
-+----------------------------------+------------------------------------------+----------------+
-</pre>
-
-To download models, you can use the llama download command.
-
-#### Downloading from [Meta](https://llama.meta.com/llama-downloads/)
-
-Here is an example download command to get the 3B-Instruct/11B-Vision-Instruct model. You will need META_URL which can be obtained from [here](https://llama.meta.com/docs/getting_the_models/meta/)
-
-Download the required checkpoints using the following commands:
-```bash
-# download the 8B model, this can be run on a single GPU
-llama download --source meta --model-id Llama3.2-3B-Instruct --meta-url META_URL
-
-# you can also get the 70B model, this will require 8 GPUs however
-llama download --source meta --model-id Llama3.2-11B-Vision-Instruct --meta-url META_URL
-
-# llama-agents have safety enabled by default. For this, you will need
-# safety models -- Llama-Guard and Prompt-Guard
-llama download --source meta --model-id Prompt-Guard-86M --meta-url META_URL
-llama download --source meta --model-id Llama-Guard-3-1B --meta-url META_URL
-```
-
-#### Downloading from [Hugging Face](https://huggingface.co/meta-llama)
-
-Essentially, the same commands above work, just replace `--source meta` with `--source huggingface`.
-
-```bash
-llama download --source huggingface --model-id  Llama3.1-8B-Instruct --hf-token <HF_TOKEN>
-
-llama download --source huggingface --model-id Llama3.1-70B-Instruct --hf-token <HF_TOKEN>
-
-llama download --source huggingface --model-id Llama-Guard-3-1B --ignore-patterns *original*
-llama download --source huggingface --model-id Prompt-Guard-86M --ignore-patterns *original*
-```
-
-**Important:** Set your environment variable `HF_TOKEN` or pass in `--hf-token` to the command to validate your access. You can find your token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
-
-> **Tip:** Default for `llama download` is to run with `--ignore-patterns *.safetensors` since we use the `.pth` files in the `original` folder. For Llama Guard and Prompt Guard, however, we need safetensors. Hence, please run with `--ignore-patterns original` so that safetensors are downloaded and `.pth` files are ignored.
-
-#### Downloading via Ollama
-
-If you're already using ollama, we also have a supported Llama Stack distribution `local-ollama` and you can continue to use ollama for managing model downloads.
-
-```
-ollama pull llama3.1:8b-instruct-fp16
-ollama pull llama3.1:70b-instruct-fp16
-```
-
-> [!NOTE]
-> Only the above two models are currently supported by Ollama.
-
-
-## Step 2: Understand the models
-The `llama model` command helps you explore the model’s interface.
-
-### 2.1 Subcommands
-1. `download`: Download the model from different sources. (meta, huggingface)
-2. `list`: Lists all the models available for download with hardware requirements to deploy the models.
-3. `prompt-format`: Show llama model message formats.
-4. `describe`: Describes all the properties of the model.
-
-### 2.2 Sample Usage
-
-`llama model <subcommand> <options>`
-
-```
-llama model --help
-```
-<pre style="font-family: monospace;">
-usage: llama model [-h] {download,list,prompt-format,describe} ...
-
-Work with llama models
-
-options:
-  -h, --help            show this help message and exit
-
-model_subcommands:
-  {download,list,prompt-format,describe}
-</pre>
-
-You can use the describe command to know more about a model:
-```
-llama model describe -m Llama3.2-3B-Instruct
-```
-### 2.3 Describe
-
-<pre style="font-family: monospace;">
-+-----------------------------+----------------------------------+
-| Model                       | Llama3.2-3B-Instruct             |
-+-----------------------------+----------------------------------+
-| Hugging Face ID             | meta-llama/Llama-3.2-3B-Instruct |
-+-----------------------------+----------------------------------+
-| Description                 | Llama 3.2 3b instruct model      |
-+-----------------------------+----------------------------------+
-| Context Length              | 128K tokens                      |
-+-----------------------------+----------------------------------+
-| Weights format              | bf16                             |
-+-----------------------------+----------------------------------+
-| Model params.json           | {                                |
-|                             |     "dim": 3072,                 |
-|                             |     "n_layers": 28,              |
-|                             |     "n_heads": 24,               |
-|                             |     "n_kv_heads": 8,             |
-|                             |     "vocab_size": 128256,        |
-|                             |     "ffn_dim_multiplier": 1.0,   |
-|                             |     "multiple_of": 256,          |
-|                             |     "norm_eps": 1e-05,           |
-|                             |     "rope_theta": 500000.0,      |
-|                             |     "use_scaled_rope": true      |
-|                             | }                                |
-+-----------------------------+----------------------------------+
-| Recommended sampling params | {                                |
-|                             |     "strategy": "top_p",         |
-|                             |     "temperature": 1.0,          |
-|                             |     "top_p": 0.9,                |
-|                             |     "top_k": 0                   |
-|                             | }                                |
-+-----------------------------+----------------------------------+
-</pre>
-### 2.4 Prompt Format
-You can even run `llama model prompt-format` see all of the templates and their tokens:
-
-```
-llama model prompt-format -m Llama3.2-3B-Instruct
-```
-![alt text](resources/prompt-format.png)
-
-
-
-You will be shown a Markdown formatted description of the model interface and how prompts / messages are formatted for various scenarios.
-
-**NOTE**: Outputs in terminal are color printed to show special tokens.
-
-
-## Step 3: Building, and Configuring Llama Stack Distributions
-
- Please see our [Getting Started](getting_started.md) guide for more details on how to build and start a Llama Stack distribution.
-
-### Step 3.1 Build
-In the following steps, imagine we'll be working with a `Llama3.1-8B-Instruct` model. We will name our build `tgi` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify:
- `name`: the name for our distribution (e.g. `tgi`)
- `image_type`: our build image type (`conda | docker`)
- `distribution_spec`: our distribution specs for specifying API providers
-  - `description`: a short description of the configurations for the distribution
-  - `providers`: specifies the underlying implementation for serving each API endpoint
-  - `image_type`: `conda` | `docker` to specify whether to build the distribution in the form of Docker image or Conda environment.
-
-
-At the end of build command, we will generate `<name>-build.yaml` file storing the build configurations.
-
-After this step is complete, a file named `<name>-build.yaml` will be generated and saved at the output file path specified at the end of the command.
-
-#### Building from scratch
- For a new user, we could start off with running `llama stack build` which will allow you to a interactively enter wizard where you will be prompted to enter build configurations.
-```
-llama stack build
-```
-
-Running the command above will allow you to fill in the configuration to build your Llama Stack distribution, you will see the following outputs.
-
-```
-> Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local-llama-stack
-> Enter the image type you want your distribution to be built with (docker or conda): conda
-
- Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs.
-> Enter the API provider for the inference API: (default=meta-reference): meta-reference
-> Enter the API provider for the safety API: (default=meta-reference): meta-reference
-> Enter the API provider for the agents API: (default=meta-reference): meta-reference
-> Enter the API provider for the memory API: (default=meta-reference): meta-reference
-> Enter the API provider for the telemetry API: (default=meta-reference): meta-reference
-
- > (Optional) Enter a short description for your Llama Stack distribution:
-
-Build spec configuration saved at ~/.conda/envs/llamastack-my-local-llama-stack/my-local-llama-stack-build.yaml
-```
-
-#### Building from templates
- To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers.
-
-The following command will allow you to see the available templates and their corresponding providers.
-```
-llama stack build --list-templates
-```
-
-![alt text](resources/list-templates.png)
-
-You may then pick a template to build your distribution with providers fitted to your liking.
-
-```
-llama stack build --template tgi --image-type conda
-```
-
-```
-$ llama stack build --template tgi --image-type conda
-...
-...
-Build spec configuration saved at ~/.conda/envs/llamastack-tgi/tgi-build.yaml
-You may now run `llama stack configure tgi` or `llama stack configure ~/.conda/envs/llamastack-tgi/tgi-build.yaml`
-```
-
-#### Building from config file
- In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command.
-
- The config file will be of contents like the ones in `llama_stack/templates/`.
-
-```
-$ cat build.yaml
-
-name: ollama
-distribution_spec:
-  description: Like local, but use ollama for running LLM inference
-  providers:
-    inference: remote::ollama
-    memory: meta-reference
-    safety: meta-reference
-    agents: meta-reference
-    telemetry: meta-reference
-image_type: conda
-```
-
-```
-llama stack build --config build.yaml
-```
-
-#### How to build distribution with Docker image
-
-To build a docker image, you may start off from a template and use the `--image-type docker` flag to specify `docker` as the build image type.
-
-```
-llama stack build --template tgi --image-type docker
-```
-
-Alternatively, you may use a config file and set `image_type` to `docker` in our `<name>-build.yaml` file, and run `llama stack build <name>-build.yaml`. The `<name>-build.yaml` will be of contents like:
-
-```
-name: local-docker-example
-distribution_spec:
-  description: Use code from `llama_stack` itself to serve all llama stack APIs
-  docker_image: null
-  providers:
-    inference: meta-reference
-    memory: meta-reference-faiss
-    safety: meta-reference
-    agentic_system: meta-reference
-    telemetry: console
-image_type: docker
-```
-
-The following command allows you to build a Docker image with the name `<name>`
-```
-llama stack build --config <name>-build.yaml
-
-Dockerfile created successfully in /tmp/tmp.I0ifS2c46A/DockerfileFROM python:3.10-slim
-WORKDIR /app
-...
-...
-You can run it with: podman run -p 8000:8000 llamastack-docker-local
-Build spec configuration saved at ~/.llama/distributions/docker/docker-local-build.yaml
-```
-
-
-### Step 3.2 Configure
-After our distribution is built (either in form of docker or conda environment), we will run the following command to
-```
-llama stack configure [ <docker-image-name> | <path/to/name-build.yaml>]
-```
- For `conda` environments: <path/to/name.build.yaml> would be the generated build spec saved from Step 1.
- For `docker` images downloaded from Dockerhub, you could also use <docker-image-name> as the argument.
-   - Run `docker images` to check list of available images on your machine.
-
-```
-$ llama stack configure ~/.llama/distributions/conda/tgi-build.yaml
-
-Configuring API: inference (meta-reference)
-Enter value for model (existing: Llama3.1-8B-Instruct) (required):
-Enter value for quantization (optional):
-Enter value for torch_seed (optional):
-Enter value for max_seq_len (existing: 4096) (required):
-Enter value for max_batch_size (existing: 1) (required):
-
-Configuring API: memory (meta-reference-faiss)
-
-Configuring API: safety (meta-reference)
-Do you want to configure llama_guard_shield? (y/n): y
-Entering sub-configuration for llama_guard_shield:
-Enter value for model (default: Llama-Guard-3-1B) (required):
-Enter value for excluded_categories (default: []) (required):
-Enter value for disable_input_check (default: False) (required):
-Enter value for disable_output_check (default: False) (required):
-Do you want to configure prompt_guard_shield? (y/n): y
-Entering sub-configuration for prompt_guard_shield:
-Enter value for model (default: Prompt-Guard-86M) (required):
-
-Configuring API: agentic_system (meta-reference)
-Enter value for brave_search_api_key (optional):
-Enter value for bing_search_api_key (optional):
-Enter value for wolfram_api_key (optional):
-
-Configuring API: telemetry (console)
-
-YAML configuration has been written to ~/.llama/builds/conda/8b-instruct-run.yaml
-```
-
-After this step is successful, you should be able to find a run configuration spec in `~/.llama/builds/conda/8b-instruct-run.yaml` with the following contents. You may edit this file to change the settings.
-
-As you can see, we did basic configuration above and configured:
- inference to run on model `Llama3.1-8B-Instruct` (obtained from `llama model list`)
- Llama Guard safety shield with model `Llama-Guard-3-1B`
- Prompt Guard safety shield with model `Prompt-Guard-86M`
-
-For how these configurations are stored as yaml, checkout the file printed at the end of the configuration.
-
-Note that all configurations as well as models are stored in `~/.llama`
-
-
-### Step 3.3 Run
-Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack configure` step.
-
-```
-llama stack run ~/.llama/builds/conda/tgi-run.yaml
-```
-
-You should see the Llama Stack server start and print the APIs that it is supporting
-
-```
-$ llama stack run ~/.llama/builds/local/conda/tgi-run.yaml
-
-> initializing model parallel with size 1
-> initializing ddp with size 1
-> initializing pipeline with size 1
-Loaded in 19.28 seconds
-NCCL version 2.20.5+cuda12.4
-Finished model load YES READY
-Serving POST /inference/batch_chat_completion
-Serving POST /inference/batch_completion
-Serving POST /inference/chat_completion
-Serving POST /inference/completion
-Serving POST /safety/run_shield
-Serving POST /agentic_system/memory_bank/attach
-Serving POST /agentic_system/create
-Serving POST /agentic_system/session/create
-Serving POST /agentic_system/turn/create
-Serving POST /agentic_system/delete
-Serving POST /agentic_system/session/delete
-Serving POST /agentic_system/memory_bank/detach
-Serving POST /agentic_system/session/get
-Serving POST /agentic_system/step/get
-Serving POST /agentic_system/turn/get
-Listening on :::5000
-INFO:     Started server process [453333]
-INFO:     Waiting for application startup.
-INFO:     Application startup complete.
-INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
-```
-
-> [!NOTE]
-> Configuration is in `~/.llama/builds/local/conda/tgi-run.yaml`. Feel free to increase `max_seq_len`.
-
-> [!IMPORTANT]
-> The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines.
-
-> [!TIP]
-> You might need to use the flag `--disable-ipv6` to  Disable IPv6 support
-
-This server is running a Llama model locally.
-
-### Step 3.4 Test with Client
-Once the server is setup, we can test it with a client to see the example outputs.
-```
-cd /path/to/llama-stack
-conda activate <env>  # any environment containing the llama-stack pip package will work
-
-python -m llama_stack.apis.inference.client localhost 5000
-```
-
-This will run the chat completion client and query the distribution’s /inference/chat_completion API.
-
-Here is an example output:
-```
-User>hello world, write me a 2 sentence poem about the moon
-Assistant> Here's a 2-sentence poem about the moon:
-
-The moon glows softly in the midnight sky,
-A beacon of wonder, as it passes by.
-```
-
-Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by:
-
-```
-python -m llama_stack.apis.safety.client localhost 5000
-```
-
-You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
--- a/docs/contbuild.sh
+++ b/docs/contbuild.sh
@ -0,0 +1,7 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+sphinx-autobuild --write-all source build/html --watch source/
--- a/docs/developer_cookbook.md
+++ b/docs/developer_cookbook.md
@ -1,41 +0,0 @@
-# Llama Stack Developer Cookbook
-
-Based on your developer needs, below are references to guides to help you get started.
-
-### Hosted Llama Stack Endpoint
-* Developer Need: I want to connect to a Llama Stack endpoint to build my applications.
-* Effort: 1min
-* Guide:
-  - Checkout our [DeepLearning course](https://www.deeplearning.ai/short-courses/introducing-multimodal-llama-3-2) on building with Llama Stack apps on pre-hosted Llama Stack endpoint.
-
-
-### Local meta-reference Llama Stack Server
-* Developer Need: I want to start a local Llama Stack server with my GPU using meta-reference implementations.
-* Effort: 5min
-* Guide:
-  - Please see our [Getting Started Guide](./getting_started.md) on starting up a meta-reference Llama Stack server.
-
-### Llama Stack Server with Remote Providers
-* Developer need: I want a Llama Stack distribution with a remote provider.
-* Effort: 10min
-* Guide
-  - Please see our [Distributions Guide](../distributions/) on starting up distributions with remote providers.
-
-
-### On-Device (iOS) Llama Stack
-* Developer Need: I want to use Llama Stack on-Device
-* Effort: 1.5hr
-* Guide:
-  - Please see our [iOS Llama Stack SDK](../llama_stack/providers/impls/ios/inference) implementations
-
-### Assemble your own Llama Stack Distribution
-* Developer Need: I want to assemble my own distribution with API providers to my likings
-* Effort: 30min
-* Guide
-  - Please see our [Building Distribution](./building_distro.md) guide for assembling your own Llama Stack distribution with your choice of API providers.
-
-### Adding a New API Provider
-* Developer Need: I want to add a new API provider to Llama Stack.
-* Effort: 3hr
-* Guide
-  - Please see our [Adding a New API Provider](./new_api_provider.md) guide for adding a new API provider.
--- a/docs/getting_started.ipynb
+++ b/docs/getting_started.ipynb
--- a/docs/getting_started.md
+++ b/docs/getting_started.md
@ -1,230 +0,0 @@
-# Getting Started with Llama Stack
-
-This guide will walk you though the steps to get started on end-to-end flow for LlamaStack. This guide mainly focuses on getting started with building a LlamaStack distribution, and starting up a LlamaStack server. Please see our [documentations](../README.md) on what you can do with Llama Stack, and [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) on examples apps built with Llama Stack.
-
-## Installation
-The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-stack` package.
-
-You have two ways to install this repository:
-
-1. **Install as a package**:
-   You can install the repository directly from [PyPI](https://pypi.org/project/llama-stack/) by running the following command:
-   ```bash
-   pip install llama-stack
-   ```
-
-2. **Install from source**:
-   If you prefer to install from the source code, follow these steps:
-   ```bash
-    mkdir -p ~/local
-    cd ~/local
-    git clone git@github.com:meta-llama/llama-stack.git
-
-    conda create -n stack python=3.10
-    conda activate stack
-
-    cd llama-stack
-    $CONDA_PREFIX/bin/pip install -e .
-   ```
-
-For what you can do with the Llama CLI, please refer to [CLI Reference](./cli_reference.md).
-
-## Starting Up Llama Stack Server
-
-You have two ways to start up Llama stack server:
-
-1. **Starting up server via docker**:
-
-We provide pre-built Docker image of Llama Stack distribution, which can be found in the following links in the [distributions](../distributions/) folder.
-
-> [!NOTE]
-> For GPU inference, you need to set these environment variables for specifying local directory containing your model checkpoints, and enable GPU inference to start running docker container.
-```
-export LLAMA_CHECKPOINT_DIR=~/.llama
-```
-
-> [!NOTE]
-> `~/.llama` should be the path containing downloaded weights of Llama models.
-
-To download llama models, use
-```
-llama download --model-id Llama3.1-8B-Instruct
-```
-
-To download and start running a pre-built docker container, you may use the following commands:
-
-```
-cd llama-stack/distributions/meta-reference-gpu
-docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
-```
-
-> [!TIP]
-> Pro Tip: We may use `docker compose up` for starting up a distribution with remote providers (e.g. TGI) using [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general). You can checkout [these scripts](../distributions/) to help you get started.
-
-
-2. **Build->Configure->Run Llama Stack server via conda**:
-
-	You may also build a LlamaStack distribution from scratch, configure it, and start running the distribution. This is useful for developing on LlamaStack.
-
-	**`llama stack build`**
-	- You'll be prompted to enter build information interactively.
-	```
-	llama stack build
-
-	> Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local-stack
-	> Enter the image type you want your distribution to be built with (docker or conda): conda
-
-	Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs.
-	> Enter the API provider for the inference API: (default=meta-reference): meta-reference
-	> Enter the API provider for the safety API: (default=meta-reference): meta-reference
-	> Enter the API provider for the agents API: (default=meta-reference): meta-reference
-	> Enter the API provider for the memory API: (default=meta-reference): meta-reference
-	> Enter the API provider for the telemetry API: (default=meta-reference): meta-reference
-
-	> (Optional) Enter a short description for your Llama Stack distribution:
-
-	Build spec configuration saved at ~/.conda/envs/llamastack-my-local-stack/my-local-stack-build.yaml
-	You can now run `llama stack configure my-local-stack`
-	```
-
-	**`llama stack configure`**
-	- Run `llama stack configure <name>` with the name you have previously defined in `build` step.
-	```
-	llama stack configure <name>
-	```
-	- You will be prompted to enter configurations for your Llama Stack
-
-	```
-	$ llama stack configure my-local-stack
-
-	Configuring API `inference`...
-	=== Configuring provider `meta-reference` for API inference...
-	Enter value for model (default: Llama3.1-8B-Instruct) (required):
-	Do you want to configure quantization? (y/n): n
-	Enter value for torch_seed (optional):
-	Enter value for max_seq_len (default: 4096) (required):
-	Enter value for max_batch_size (default: 1) (required):
-
-	Configuring API `safety`...
-	=== Configuring provider `meta-reference` for API safety...
-	Do you want to configure llama_guard_shield? (y/n): n
-	Do you want to configure prompt_guard_shield? (y/n): n
-
-	Configuring API `agents`...
-	=== Configuring provider `meta-reference` for API agents...
-	Enter `type` for persistence_store (options: redis, sqlite, postgres) (default: sqlite):
-
-	Configuring SqliteKVStoreConfig:
-	Enter value for namespace (optional):
-	Enter value for db_path (default: /home/xiyan/.llama/runtime/kvstore.db) (required):
-
-	Configuring API `memory`...
-	=== Configuring provider `meta-reference` for API memory...
-	> Please enter the supported memory bank type your provider has for memory: vector
-
-	Configuring API `telemetry`...
-	=== Configuring provider `meta-reference` for API telemetry...
-
-	> YAML configuration has been written to ~/.llama/builds/conda/my-local-stack-run.yaml.
-	You can now run `llama stack run my-local-stack --port PORT`
-	```
-
-	**`llama stack run`**
-	- Run `llama stack run <name>` with the name you have previously defined.
-	```
-	llama stack run my-local-stack
-
-	...
-	> initializing model parallel with size 1
-	> initializing ddp with size 1
-	> initializing pipeline with size 1
-	...
-	Finished model load YES READY
-	Serving POST /inference/chat_completion
-	Serving POST /inference/completion
-	Serving POST /inference/embeddings
-	Serving POST /memory_banks/create
-	Serving DELETE /memory_bank/documents/delete
-	Serving DELETE /memory_banks/drop
-	Serving GET /memory_bank/documents/get
-	Serving GET /memory_banks/get
-	Serving POST /memory_bank/insert
-	Serving GET /memory_banks/list
-	Serving POST /memory_bank/query
-	Serving POST /memory_bank/update
-	Serving POST /safety/run_shield
-	Serving POST /agentic_system/create
-	Serving POST /agentic_system/session/create
-	Serving POST /agentic_system/turn/create
-	Serving POST /agentic_system/delete
-	Serving POST /agentic_system/session/delete
-	Serving POST /agentic_system/session/get
-	Serving POST /agentic_system/step/get
-	Serving POST /agentic_system/turn/get
-	Serving GET /telemetry/get_trace
-	Serving POST /telemetry/log_event
-	Listening on :::5000
-	INFO:     Started server process [587053]
-	INFO:     Waiting for application startup.
-	INFO:     Application startup complete.
-	INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
-	```
-
-
-## Testing with client
-Once the server is setup, we can test it with a client to see the example outputs.
-```
-cd /path/to/llama-stack
-conda activate <env>  # any environment containing the llama-stack pip package will work
-
-python -m llama_stack.apis.inference.client localhost 5000
-```
-
-This will run the chat completion client and query the distribution’s `/inference/chat_completion` API.
-
-Here is an example output:
-```
-User>hello world, write me a 2 sentence poem about the moon
-Assistant> Here's a 2-sentence poem about the moon:
-
-The moon glows softly in the midnight sky,
-A beacon of wonder, as it passes by.
-```
-
-You may also send a POST request to the server:
-```
-curl http://localhost:5000/inference/chat_completion \
-H "Content-Type: application/json" \
-d '{
-	"model": "Llama3.1-8B-Instruct",
-	"messages": [
-		{"role": "system", "content": "You are a helpful assistant."},
-		{"role": "user", "content": "Write me a 2 sentence poem about the moon"}
-	],
-	"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
-}'
-
-Output:
-{'completion_message': {'role': 'assistant',
-  'content': 'The moon glows softly in the midnight sky, \nA beacon of wonder, as it catches the eye.',
-  'stop_reason': 'out_of_tokens',
-  'tool_calls': []},
- 'logprobs': null}
-
-```
-
-
-Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by:
-
-```
-python -m llama_stack.apis.safety.client localhost 5000
-```
-
-
-Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications.
-
-You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
-
-
-## Advanced Guides
-Please see our [Building a LLama Stack Distribution](./building_distro.md) guide for more details on how to assemble your own Llama Stack Distribution.
--- a/docs/new_api_provider.md
+++ b/docs/new_api_provider.md
@ -1,26 +0,0 @@
-# Developer Guide: Adding a New API Provider
-
-This guide contains references to walk you through adding a new API provider.
-
-### Adding a new API provider
-1. First, decide which API your provider falls into (e.g. Inference, Safety, Agents, Memory).
-2. Decide whether your provider is a remote provider, or inline implmentation. A remote provider is a provider that makes a remote request to an service. An inline provider is a provider where implementation is executed locally. Checkout the examples, and follow the structure to add your own API provider. Please find the following code pointers:
-
-    - [Inference Remote Adapter](../llama_stack/providers/adapters/inference/)
-    - [Inference Inline Provider](../llama_stack/providers/impls/)
-
-3. [Build a Llama Stack distribution](./building_distro.md) with your API provider.
-4. Test your code!
-
-### Testing your newly added API providers
-
-1. Start with an _integration test_ for your provider. That means we will instantiate the real provider, pass it real configuration and if it is a remote service, we will actually hit the remote service. We **strongly** discourage mocking for these tests at the provider level. Llama Stack is first and foremost about integration so we need to make sure stuff works end-to-end. See [llama_stack/providers/tests/inference/test_inference.py](../llama_stack/providers/tests/inference/test_inference.py) for an example.
-
-2. In addition, if you want to unit test functionality within your provider, feel free to do so. You can find some tests in `tests/` but they aren't well supported so far.
-
-3. Test with a client-server Llama Stack setup. (a) Start a Llama Stack server with your own distribution which includes the new provider. (b) Send a client request to the server. See `llama_stack/apis/<api>/client.py` for how this is done. These client scripts can serve as lightweight tests.
-
-You can find more complex client scripts [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) repo. Note down which scripts works and do not work with your distribution.
-
-### Submit your PR
-After you have fully tested your newly added API provider, submit a PR with the attached test plan. You must have a Test Plan in the summary section of your PR.
--- a/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb
+++ b/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb
--- a/docs/openapi_generator/generate.py
+++ b/docs/openapi_generator/generate.py
@ -18,73 +18,22 @@ import yaml

 from llama_models import schema_utils

-from .pyopenapi.options import Options
-from .pyopenapi.specification import Info, Server
-from .pyopenapi.utility import Specification
-
 # We do some monkey-patching to ensure our definitions only use the minimal
 # (json_schema_type, webmethod) definitions from the llama_models package. For
 # generation though, we need the full definitions and implementations from the
 #  (json-strong-typing) package.

-from .strong_typing.schema import json_schema_type
+from .strong_typing.schema import json_schema_type, register_schema

 schema_utils.json_schema_type = json_schema_type
+schema_utils.register_schema = register_schema

-from llama_models.llama3.api.datatypes import *  # noqa: F403
-from llama_stack.apis.agents import *  # noqa: F403
-from llama_stack.apis.datasets import *  # noqa: F403
-from llama_stack.apis.datasetio import *  # noqa: F403
-from llama_stack.apis.scoring import *  # noqa: F403
-from llama_stack.apis.scoring_functions import *  # noqa: F403
-from llama_stack.apis.eval import *  # noqa: F403
-from llama_stack.apis.inference import *  # noqa: F403
-from llama_stack.apis.batch_inference import *  # noqa: F403
-from llama_stack.apis.memory import *  # noqa: F403
-from llama_stack.apis.telemetry import *  # noqa: F403
-from llama_stack.apis.post_training import *  # noqa: F403
-from llama_stack.apis.synthetic_data_generation import *  # noqa: F403
-from llama_stack.apis.safety import *  # noqa: F403
-from llama_stack.apis.models import *  # noqa: F403
-from llama_stack.apis.memory_banks import *  # noqa: F403
-from llama_stack.apis.shields import *  # noqa: F403
-from llama_stack.apis.inspect import *  # noqa: F403
+from llama_stack.apis.version import LLAMA_STACK_API_VERSION  # noqa: E402
+from llama_stack.distribution.stack import LlamaStack  # noqa: E402

-
-class LlamaStack(
-    MemoryBanks,
-    Inference,
-    BatchInference,
-    Agents,
-    Safety,
-    SyntheticDataGeneration,
-    Datasets,
-    Telemetry,
-    PostTraining,
-    Memory,
-    Eval,
-    Scoring,
-    ScoringFunctions,
-    DatasetIO,
-    Models,
-    Shields,
-    Inspect,
-):
-    pass
-
-
-# TODO: this should be fixed in the generator itself so it reads appropriate annotations
-STREAMING_ENDPOINTS = [
-    "/agents/turn/create",
-    "/inference/chat_completion",
-]
-
-
-def patch_sse_stream_responses(spec: Specification):
-    for path, path_item in spec.document.paths.items():
-        if path in STREAMING_ENDPOINTS:
-            content = path_item.post.responses["200"].content.pop("application/json")
-            path_item.post.responses["200"].content["text/event-stream"] = content
+from .pyopenapi.options import Options  # noqa: E402
+from .pyopenapi.specification import Info, Server  # noqa: E402
+from .pyopenapi.utility import Specification  # noqa: E402


 def main(output_dir: str):
@ -102,19 +51,15 @@ def main(output_dir: str):
        Options(
            server=Server(url="http://any-hosted-llama-stack.com"),
            info=Info(
-                title="[DRAFT] Llama Stack Specification",
-                version="0.0.1",
-                description="""This is the specification of the llama stack that provides
+                title="Llama Stack Specification",
+                version=LLAMA_STACK_API_VERSION,
+                description="""This is the specification of the Llama Stack that provides
                a set of endpoints and their corresponding interfaces that are tailored to
-                best leverage Llama Models. The specification is still in draft and subject to change.
-                Generated at """
-                + now,
+                best leverage Llama Models.""",
            ),
        ),
    )

-    patch_sse_stream_responses(spec)
-
    with open(output_dir / "llama-stack-spec.yaml", "w", encoding="utf-8") as fp:
        yaml.dump(spec.get_json(), fp, allow_unicode=True)

--- a/docs/openapi_generator/pyopenapi/generator.py
+++ b/docs/openapi_generator/pyopenapi/generator.py
@ -4,6 +4,7 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

+import collections
 import hashlib
 import ipaddress
 import typing
@ -176,9 +177,20 @@ class ContentBuilder:
    ) -> Dict[str, MediaType]:
        "Creates the content subtree for a request or response."

+        def has_iterator_type(t):
+            if typing.get_origin(t) is typing.Union:
+                return any(has_iterator_type(a) for a in typing.get_args(t))
+            else:
+                # TODO: needs a proper fix where we let all types correctly flow upwards
+                # and then test against AsyncIterator
+                return "StreamChunk" in str(t)
+
        if is_generic_list(payload_type):
            media_type = "application/jsonl"
            item_type = unwrap_generic_list(payload_type)
+        elif has_iterator_type(payload_type):
+            item_type = payload_type
+            media_type = "text/event-stream"
        else:
            media_type = "application/json"
            item_type = payload_type
@ -190,7 +202,9 @@ class ContentBuilder:
    ) -> MediaType:
        schema = self.schema_builder.classdef_to_ref(item_type)
        if self.schema_transformer:
-            schema_transformer: Callable[[SchemaOrRef], SchemaOrRef] = self.schema_transformer  # type: ignore
+            schema_transformer: Callable[[SchemaOrRef], SchemaOrRef] = (
+                self.schema_transformer
+            )
            schema = schema_transformer(schema)

        if not examples:
@ -424,6 +438,14 @@ class Generator:
        return extra_tags

    def _build_operation(self, op: EndpointOperation) -> Operation:
+        if op.defining_class.__name__ in [
+            "SyntheticDataGeneration",
+            "PostTraining",
+            "BatchInference",
+        ]:
+            op.defining_class.__name__ = f"{op.defining_class.__name__} (Coming Soon)"
+            print(op.defining_class.__name__)
+
        doc_string = parse_type(op.func_ref)
        doc_params = dict(
            (param.name, param.description) for param in doc_string.params.values()
@ -464,13 +486,22 @@ class Generator:
        parameters = path_parameters + query_parameters
        parameters += [
            Parameter(
-                name="X-LlamaStack-ProviderData",
+                name="X-LlamaStack-Provider-Data",
                in_=ParameterLocation.Header,
                description="JSON-encoded provider data which will be made available to the adapter servicing the API",
                required=False,
                schema=self.schema_builder.classdef_to_ref(str),
            )
        ]
+        parameters += [
+            Parameter(
+                name="X-LlamaStack-Client-Version",
+                in_=ParameterLocation.Header,
+                description="Version of the client making the request. This is used to ensure that the client and server are compatible.",
+                required=False,
+                schema=self.schema_builder.classdef_to_ref(str),
+            )
+        ]

        # data passed in payload
        if op.request_params:
@ -506,7 +537,6 @@ class Generator:
            success_type_descriptions = {
                item: doc_string.short_description
                for item, doc_string in success_type_docstring.items()
-                if doc_string.short_description
            }
        else:
            # use return type as a single response type
@ -565,6 +595,7 @@ class Generator:
            )
            responses.update(response_builder.build_response(response_options))

+        assert len(responses.keys()) > 0, f"No responses found for {op.name}"
        if op.event_type is not None:
            builder = ContentBuilder(self.schema_builder)
            callbacks = {
@ -618,6 +649,7 @@ class Generator:
                raise NotImplementedError(f"unknown HTTP method: {op.http_method}")

            route = op.get_route()
+            print(f"route: {route}")
            if route in paths:
                paths[route].update(pathItem)
            else:
@ -671,6 +703,8 @@ class Generator:
        for extra_tag_group in extra_tag_groups.values():
            tags.extend(extra_tag_group)

+        tags = sorted(tags, key=lambda t: t.name)
+
        tag_groups = []
        if operation_tags:
            tag_groups.append(
--- a/docs/openapi_generator/pyopenapi/operations.py
+++ b/docs/openapi_generator/pyopenapi/operations.py
@ -8,18 +8,14 @@ import collections.abc
 import enum
 import inspect
 import typing
-import uuid
 from dataclasses import dataclass
 from typing import Any, Callable, Dict, Iterable, Iterator, List, Optional, Tuple, Union

+from llama_stack.apis.version import LLAMA_STACK_API_VERSION
+
 from termcolor import colored

-from ..strong_typing.inspection import (
-    get_signature,
-    is_type_enum,
-    is_type_optional,
-    unwrap_optional_type,
-)
+from ..strong_typing.inspection import get_signature


 def split_prefix(
@ -111,9 +107,9 @@ class EndpointOperation:

    def get_route(self) -> str:
        if self.route is not None:
-            return self.route
+            return "/".join(["", LLAMA_STACK_API_VERSION, self.route.lstrip("/")])

-        route_parts = ["", self.name]
+        route_parts = ["", LLAMA_STACK_API_VERSION, self.name]
        for param_name, _ in self.path_params:
            route_parts.append("{" + param_name + "}")
        return "/".join(route_parts)
@ -176,10 +172,16 @@ def _get_endpoint_functions(
 def _get_defining_class(member_fn: str, derived_cls: type) -> type:
    "Find the class in which a member function is first defined in a class inheritance hierarchy."

+    # This import must be dynamic here
+    from llama_stack.apis.tools import RAGToolRuntime, ToolRuntime
+
    # iterate in reverse member resolution order to find most specific class first
    for cls in reversed(inspect.getmro(derived_cls)):
        for name, _ in inspect.getmembers(cls, inspect.isfunction):
            if name == member_fn:
+                # HACK ALERT
+                if cls == RAGToolRuntime:
+                    return ToolRuntime
                return cls

    raise ValidationError(
@ -260,42 +262,16 @@ def get_endpoint_operations(
                    f"parameter '{param_name}' in function '{func_name}' has no type annotation"
                )

-            if is_type_optional(param_type):
-                inner_type: type = unwrap_optional_type(param_type)
-            else:
-                inner_type = param_type
-
-            if prefix == "get" and (
-                inner_type is bool
-                or inner_type is int
-                or inner_type is float
-                or inner_type is str
-                or inner_type is uuid.UUID
-                or is_type_enum(inner_type)
-            ):
-                if parameter.kind == inspect.Parameter.POSITIONAL_ONLY:
-                    if route_params is not None and param_name not in route_params:
-                        raise ValidationError(
-                            f"positional parameter '{param_name}' absent from user-defined route '{route}' for function '{func_name}'"
-                        )
-
-                    # simple type maps to route path element, e.g. /study/{uuid}/{version}
+            if prefix in ["get", "delete"]:
+                if route_params is not None and param_name in route_params:
                    path_params.append((param_name, param_type))
                else:
-                    if route_params is not None and param_name in route_params:
-                        raise ValidationError(
-                            f"query parameter '{param_name}' found in user-defined route '{route}' for function '{func_name}'"
-                        )
-
-                    # simple type maps to key=value pair in query string
                    query_params.append((param_name, param_type))
            else:
                if route_params is not None and param_name in route_params:
-                    raise ValidationError(
-                        f"user-defined route '{route}' for function '{func_name}' has parameter '{param_name}' of composite type: {param_type}"
-                    )
-
-                request_params.append((param_name, param_type))
+                    path_params.append((param_name, param_type))
+                else:
+                    request_params.append((param_name, param_type))

        # check if function has explicit return type
        if signature.return_annotation is inspect.Signature.empty:
@ -315,21 +291,33 @@ def get_endpoint_operations(
                )
        else:
            event_type = None
-            response_type = return_type

-        # set HTTP request method based on type of request and presence of payload
-        if not request_params:
+            def process_type(t):
+                if typing.get_origin(t) is collections.abc.AsyncIterator:
+                    # NOTE(ashwin): this is SSE and there is no way to represent it. either we make it a List
+                    # or the item type. I am choosing it to be the latter
+                    args = typing.get_args(t)
+                    return args[0]
+                elif typing.get_origin(t) is typing.Union:
+                    types = [process_type(a) for a in typing.get_args(t)]
+                    return typing._UnionGenericAlias(typing.Union, tuple(types))
+                else:
+                    return t
+
+            response_type = process_type(return_type)
+
            if prefix in ["delete", "remove"]:
                http_method = HTTPMethod.DELETE
-            else:
+            elif prefix == "post":
+                http_method = HTTPMethod.POST
+            elif prefix == "get":
                http_method = HTTPMethod.GET
-        else:
-            if prefix == "set":
+            elif prefix == "set":
                http_method = HTTPMethod.PUT
            elif prefix == "update":
                http_method = HTTPMethod.PATCH
            else:
-                http_method = HTTPMethod.POST
+                raise ValidationError(f"unknown prefix {prefix}")

        result.append(
            EndpointOperation(
--- a/docs/openapi_generator/strong_typing/classdef.py
+++ b/docs/openapi_generator/strong_typing/classdef.py
@ -125,6 +125,7 @@ class JsonSchemaAnyOf(JsonSchemaNode):
@dataclass
 class JsonSchemaOneOf(JsonSchemaNode):
    oneOf: List["JsonSchemaAny"]
+    discriminator: Optional[str]


 JsonSchemaAny = Union[
--- a/docs/openapi_generator/strong_typing/inspection.py
+++ b/docs/openapi_generator/strong_typing/inspection.py
@ -342,7 +342,6 @@ def is_type_union(typ: object) -> bool:
    "True if the type annotation corresponds to a union type (e.g. `Union[T1,T2,T3]`)."

    typ = unwrap_annotated_type(typ)
-
    if _is_union_like(typ):
        args = typing.get_args(typ)
        return len(args) > 2 or type(None) not in args
@ -358,6 +357,7 @@ def unwrap_union_types(typ: object) -> Tuple[object, ...]:
    :returns: The inner types `T1`, `T2`, etc.
    """

+    typ = unwrap_annotated_type(typ)
    return _unwrap_union_types(typ)


--- a/docs/openapi_generator/strong_typing/schema.py
+++ b/docs/openapi_generator/strong_typing/schema.py
@ -36,6 +36,7 @@ from typing import (
 )

 import jsonschema
+from typing_extensions import Annotated

 from . import docstring
 from .auxiliary import (
@ -329,7 +330,6 @@ class JsonSchemaGenerator:
        if metadata is not None:
            # type is Annotated[T, ...]
            typ = typing.get_args(data_type)[0]
-
            schema = self._simple_type_to_schema(typ)
            if schema is not None:
                # recognize well-known auxiliary types
@ -446,12 +446,20 @@ class JsonSchemaGenerator:
                ],
            }
        elif origin_type is Union:
-            return {
+            discriminator = None
+            if typing.get_origin(data_type) is Annotated:
+                discriminator = typing.get_args(data_type)[1].discriminator
+            ret = {
                "oneOf": [
                    self.type_to_schema(union_type)
                    for union_type in typing.get_args(typ)
                ]
            }
+            if discriminator:
+                ret["discriminator"] = {
+                    "propertyName": discriminator,
+                }
+            return ret
        elif origin_type is Literal:
            (literal_value,) = typing.get_args(typ)  # unpack value of literal type
            schema = self.type_to_schema(type(literal_value))
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@ -1,3 +1,13 @@
 sphinx
 myst-parser
 linkify
+-e git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
+sphinx-rtd-theme>=1.0.0
+sphinx-pdj-theme
+sphinx-copybutton
+sphinx-tabs
+sphinx-design
+sphinxcontrib-openapi
+sphinxcontrib-redoc
+sphinxcontrib-mermaid
+sphinxcontrib-video
--- a/docs/resources/llama-stack-spec.html
+++ b/docs/resources/llama-stack-spec.html
--- a/docs/resources/llama-stack-spec.yaml
+++ b/docs/resources/llama-stack-spec.yaml
--- a/docs/source/building_applications/agent_execution_loop.md
+++ b/docs/source/building_applications/agent_execution_loop.md
@ -0,0 +1,133 @@
+## Agent Execution Loop
+
+Agents are the heart of complex AI applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks.
+
+Each agent turn follows these key steps:
+
+1. **Initial Safety Check**: The user's input is first screened through configured safety shields
+
+2. **Context Retrieval**:
+   - If RAG is enabled, the agent queries relevant documents from memory banks
+   - For new documents, they are first inserted into the memory bank
+   - Retrieved context is augmented to the user's prompt
+
+3. **Inference Loop**: The agent enters its main execution loop:
+   - The LLM receives the augmented prompt (with context and/or previous tool outputs)
+   - The LLM generates a response, potentially with tool calls
+   - If tool calls are present:
+     - Tool inputs are safety-checked
+     - Tools are executed (e.g., web search, code execution)
+     - Tool responses are fed back to the LLM for synthesis
+   - The loop continues until:
+     - The LLM provides a final response without tool calls
+     - Maximum iterations are reached
+     - Token limit is exceeded
+
+4. **Final Safety Check**: The agent's final response is screened through safety shields
+
+```{mermaid}
+sequenceDiagram
+    participant U as User
+    participant E as Executor
+    participant M as Memory Bank
+    participant L as LLM
+    participant T as Tools
+    participant S as Safety Shield
+
+    Note over U,S: Agent Turn Start
+    U->>S: 1. Submit Prompt
+    activate S
+    S->>E: Input Safety Check
+    deactivate S
+
+    E->>M: 2.1 Query Context
+    M-->>E: 2.2 Retrieved Documents
+
+    loop Inference Loop
+        E->>L: 3.1 Augment with Context
+        L-->>E: 3.2 Response (with/without tool calls)
+
+        alt Has Tool Calls
+            E->>S: Check Tool Input
+            S->>T: 4.1 Execute Tool
+            T-->>E: 4.2 Tool Response
+            E->>L: 5.1 Tool Response
+            L-->>E: 5.2 Synthesized Response
+        end
+
+        opt Stop Conditions
+            Note over E: Break if:
+            Note over E: - No tool calls
+            Note over E: - Max iterations reached
+            Note over E: - Token limit exceeded
+        end
+    end
+
+    E->>S: Output Safety Check
+    S->>U: 6. Final Response
+```
+
+Each step in this process can be monitored and controlled through configurations. Here's an example that demonstrates monitoring the agent's execution:
+
+```python
+from llama_stack_client.lib.agents.event_logger import EventLogger
+
+agent_config = AgentConfig(
+    model="Llama3.2-3B-Instruct",
+    instructions="You are a helpful assistant",
+    # Enable both RAG and tool usage
+    tools=[
+        {
+            "type": "memory",
+            "memory_bank_configs": [{
+                "type": "vector",
+                "bank_id": "my_docs"
+            }],
+            "max_tokens_in_context": 4096
+        },
+        {
+            "type": "code_interpreter",
+            "enable_inline_code_execution": True
+        }
+    ],
+    # Configure safety
+    input_shields=["content_safety"],
+    output_shields=["content_safety"],
+    # Control the inference loop
+    max_infer_iters=5,
+    sampling_params={
+        "strategy": {
+            "type": "top_p",
+            "temperature": 0.7,
+            "top_p": 0.95
+        },
+        "max_tokens": 2048
+    }
+)
+
+agent = Agent(client, agent_config)
+session_id = agent.create_session("monitored_session")
+
+# Stream the agent's execution steps
+response = agent.create_turn(
+    messages=[{"role": "user", "content": "Analyze this code and run it"}],
+    attachments=[{
+        "content": "https://raw.githubusercontent.com/example/code.py",
+        "mime_type": "text/plain"
+    }],
+    session_id=session_id
+)
+
+# Monitor each step of execution
+for log in EventLogger().log(response):
+    if log.event.step_type == "memory_retrieval":
+        print("Retrieved context:", log.event.retrieved_context)
+    elif log.event.step_type == "inference":
+        print("LLM output:", log.event.model_response)
+    elif log.event.step_type == "tool_execution":
+        print("Tool call:", log.event.tool_call)
+        print("Tool response:", log.event.tool_response)
+    elif log.event.step_type == "shield_call":
+        if log.event.violation:
+            print("Safety violation:", log.event.violation)
+```
--- a/docs/source/building_applications/evals.md
+++ b/docs/source/building_applications/evals.md
@ -0,0 +1,169 @@
+# Evals
+
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)
+
+Llama Stack provides the building blocks needed to run benchmark and application evaluations. This guide will walk you through how to use these components to run open benchmark evaluations. Visit our [Evaluation Concepts](../concepts/evaluation_concepts.md) guide for more details on how evaluations work in Llama Stack, and our [Evaluation Reference](../references/evals_reference/index.md) guide for a comprehensive reference on the APIs.
+
+### 1. Open Benchmark Model Evaluation
+
+This first example walks you through how to evaluate a model candidate served by Llama Stack on open benchmarks. We will use the following benchmark:
+- [MMMU](https://arxiv.org/abs/2311.16502) (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI): Benchmark designed to evaluate multimodal models.
+- [SimpleQA](https://openai.com/index/introducing-simpleqa/): Benchmark designed to access models to answer short, fact-seeking questions.
+
+#### 1.1 Running MMMU
+- We will use a pre-processed MMMU dataset from [llamastack/mmmu](https://huggingface.co/datasets/llamastack/mmmu). The preprocessing code is shown in in this [Github Gist](https://gist.github.com/yanxi0830/118e9c560227d27132a7fd10e2c92840). The dataset is obtained by transforming the original [MMMU/MMMU](https://huggingface.co/datasets/MMMU/MMMU) dataset into correct format by `inference/chat-completion` API.
+
+```python
+import datasets
+ds = datasets.load_dataset(path="llamastack/mmmu", name="Agriculture", split="dev")
+ds = ds.select_columns(["chat_completion_input", "input_query", "expected_answer"])
+eval_rows = ds.to_pandas().to_dict(orient="records")
+```
+
+- Next, we will run evaluation on an model candidate, we will need to:
+  - Define a system prompt
+  - Define an EvalCandidate
+  - Run evaluate on the dataset
+
+```python
+SYSTEM_PROMPT_TEMPLATE = """
+You are an expert in Agriculture whose job is to answer questions from the user using images.
+First, reason about the correct answer.
+Then write the answer in the following format where X is exactly one of A,B,C,D:
+Answer: X
+Make sure X is one of A,B,C,D.
+If you are uncertain of the correct answer, guess the most likely one.
+"""
+
+system_message = {
+    "role": "system",
+    "content": SYSTEM_PROMPT_TEMPLATE,
+}
+
+client.eval_tasks.register(
+    eval_task_id="meta-reference::mmmu",
+    dataset_id=f"mmmu-{subset}-{split}",
+    scoring_functions=["basic::regex_parser_multiple_choice_answer"]
+)
+
+response = client.eval.evaluate_rows(
+    task_id="meta-reference::mmmu",
+    input_rows=eval_rows,
+    scoring_functions=["basic::regex_parser_multiple_choice_answer"],
+    task_config={
+        "type": "benchmark",
+        "eval_candidate": {
+            "type": "model",
+            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
+            "sampling_params": {
+                "strategy": {
+                    "type": "greedy",
+                },
+                "max_tokens": 4096,
+                "repeat_penalty": 1.0,
+            },
+            "system_message": system_message
+        }
+    }
+)
+```
+
+#### 1.2. Running SimpleQA
+- We will use a pre-processed SimpleQA dataset from [llamastack/evals](https://huggingface.co/datasets/llamastack/evals/viewer/evals__simpleqa) which is obtained by transforming the input query into correct format accepted by `inference/chat-completion` API.
+- Since we will be using this same dataset in our next example for Agentic evaluation, we will register it using the `/datasets` API, and interact with it through `/datasetio` API.
+
+```python
+simpleqa_dataset_id = "huggingface::simpleqa"
+
+_ = client.datasets.register(
+    dataset_id=simpleqa_dataset_id,
+    provider_id="huggingface",
+    url={"uri": "https://huggingface.co/datasets/llamastack/evals"},
+    metadata={
+        "path": "llamastack/evals",
+        "name": "evals__simpleqa",
+        "split": "train",
+    },
+    dataset_schema={
+        "input_query": {"type": "string"},
+        "expected_answer": {"type": "string"},
+        "chat_completion_input": {"type": "chat_completion_input"},
+    }
+)
+
+eval_rows = client.datasetio.get_rows_paginated(
+    dataset_id=simpleqa_dataset_id,
+    rows_in_page=5,
+)
+```
+
+```python
+client.eval_tasks.register(
+    eval_task_id="meta-reference::simpleqa",
+    dataset_id=simpleqa_dataset_id,
+    scoring_functions=["llm-as-judge::405b-simpleqa"]
+)
+
+response = client.eval.evaluate_rows(
+    task_id="meta-reference::simpleqa",
+    input_rows=eval_rows.rows,
+    scoring_functions=["llm-as-judge::405b-simpleqa"],
+    task_config={
+        "type": "benchmark",
+        "eval_candidate": {
+            "type": "model",
+            "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
+            "sampling_params": {
+                "strategy": {
+                    "type": "greedy",
+                },
+                "max_tokens": 4096,
+                "repeat_penalty": 1.0,
+            },
+        }
+    }
+)
+```
+
+
+### 2. Agentic Evaluation
+- In this example, we will demonstrate how to evaluate a agent candidate served by Llama Stack via `/agent` API.
+- We will continue to use the SimpleQA dataset we used in previous example.
+- Instead of running evaluation on model, we will run the evaluation on a Search Agent with access to search tool. We will define our agent evaluation candidate through `AgentConfig`.
+
+```python
+agent_config = {
+    "model": "meta-llama/Llama-3.1-405B-Instruct",
+    "instructions": "You are a helpful assistant",
+    "sampling_params": {
+        "strategy": {
+            "type": "greedy",
+        },
+    },
+    "tools": [
+        {
+            "type": "brave_search",
+            "engine": "tavily",
+            "api_key": userdata.get("TAVILY_SEARCH_API_KEY")
+        }
+    ],
+    "tool_choice": "auto",
+    "tool_prompt_format": "json",
+    "input_shields": [],
+    "output_shields": [],
+    "enable_session_persistence": False
+}
+
+response = client.eval.evaluate_rows(
+    task_id="meta-reference::simpleqa",
+    input_rows=eval_rows.rows,
+    scoring_functions=["llm-as-judge::405b-simpleqa"],
+    task_config={
+        "type": "benchmark",
+        "eval_candidate": {
+            "type": "agent",
+            "config": agent_config,
+        }
+    }
+)
+```
--- a/docs/source/building_applications/evaluation.md
+++ b/docs/source/building_applications/evaluation.md
@ -0,0 +1,36 @@
+## Testing & Evaluation
+
+Llama Stack provides built-in tools for evaluating your applications:
+
+1. **Benchmarking**: Test against standard datasets
+2. **Application Evaluation**: Score your application's outputs
+3. **Custom Metrics**: Define your own evaluation criteria
+
+Here's how to set up basic evaluation:
+
+```python
+# Create an evaluation task
+response = client.eval_tasks.register(
+    eval_task_id="my_eval",
+    dataset_id="my_dataset",
+    scoring_functions=["accuracy", "relevance"]
+)
+
+# Run evaluation
+job = client.eval.run_eval(
+    task_id="my_eval",
+    task_config={
+        "type": "app",
+        "eval_candidate": {
+            "type": "agent",
+            "config": agent_config
+        }
+    }
+)
+
+# Get results
+result = client.eval.job_result(
+    task_id="my_eval",
+    job_id=job.job_id
+)
+```
--- a/docs/source/building_applications/index.md
+++ b/docs/source/building_applications/index.md
@ -0,0 +1,29 @@
+# Building AI Applications
+
+Llama Stack provides all the building blocks needed to create sophisticated AI applications.
+
+The best way to get started is to look at this notebook which walks through the various APIs (from basic inference, to RAG agents) and how to use them.
+
+**Notebook**: [Building AI Applications](docs/notebooks/Llama_Stack_Building_AI_Applications.ipynb)
+
+Here are some key topics that will help you build effective agents:
+
+- **[Agent Execution Loop](agent_execution_loop)**
+- **[RAG](rag)**
+- **[Safety](safety)**
+- **[Tools](tools)**
+- **[Telemetry](telemetry)**
+- **[Evals](evals)**
+
+
+```{toctree}
+:hidden:
+:maxdepth: 1
+
+agent_execution_loop
+rag
+safety
+tools
+telemetry
+evals
+```
--- a/docs/source/building_applications/rag.md
+++ b/docs/source/building_applications/rag.md
@ -0,0 +1,92 @@
+## Memory & RAG
+
+Memory enables your applications to reference and recall information from previous interactions or external documents. Llama Stack's memory system is built around the concept of Memory Banks:
+
+1. **Vector Memory Banks**: For semantic search and retrieval
+2. **Key-Value Memory Banks**: For structured data storage
+3. **Keyword Memory Banks**: For basic text search
+4. **Graph Memory Banks**: For relationship-based retrieval
+
+Here's how to set up a vector memory bank for RAG:
+
+```python
+# Register a memory bank
+bank_id = "my_documents"
+response = client.memory_banks.register(
+    memory_bank_id=bank_id,
+    params={
+        "memory_bank_type": "vector",
+        "embedding_model": "all-MiniLM-L6-v2",
+        "chunk_size_in_tokens": 512
+    }
+)
+
+# Insert documents
+documents = [
+    {
+        "document_id": "doc1",
+        "content": "Your document text here",
+        "mime_type": "text/plain"
+    }
+]
+client.memory.insert(bank_id, documents)
+
+# Query documents
+results = client.memory.query(
+    bank_id=bank_id,
+    query="What do you know about...",
+)
+```
+
+
+### Building RAG-Enhanced Agents
+
+One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example:
+
+```python
+from llama_stack_client.types import Attachment
+
+# Create attachments from documents
+attachments = [
+    Attachment(
+        content="https://raw.githubusercontent.com/example/doc.rst",
+        mime_type="text/plain"
+    )
+]
+
+# Configure agent with memory
+agent_config = AgentConfig(
+    model="Llama3.2-3B-Instruct",
+    instructions="You are a helpful assistant",
+    tools=[{
+        "type": "memory",
+        "memory_bank_configs": [],
+        "query_generator_config": {"type": "default", "sep": " "},
+        "max_tokens_in_context": 4096,
+        "max_chunks": 10
+    }],
+    enable_session_persistence=True
+)
+
+agent = Agent(client, agent_config)
+session_id = agent.create_session("rag_session")
+
+# Initial document ingestion
+response = agent.create_turn(
+    messages=[{
+        "role": "user",
+        "content": "I am providing some documents for reference."
+    }],
+    attachments=attachments,
+    session_id=session_id
+)
+
+# Query with RAG
+response = agent.create_turn(
+    messages=[{
+        "role": "user",
+        "content": "What are the key topics in the documents?"
+    }],
+    session_id=session_id
+)
+```
--- a/docs/source/building_applications/safety.md
+++ b/docs/source/building_applications/safety.md
@ -0,0 +1,21 @@
+## Safety Guardrails
+
+Safety is a critical component of any AI application. Llama Stack provides a Shield system that can be applied at multiple touchpoints:
+
+```python
+# Register a safety shield
+shield_id = "content_safety"
+client.shields.register(
+    shield_id=shield_id,
+    provider_shield_id="llama-guard-basic"
+)
+
+# Run content through shield
+response = client.safety.run_shield(
+    shield_id=shield_id,
+    messages=[{"role": "user", "content": "User message here"}]
+)
+
+if response.violation:
+    print(f"Safety violation detected: {response.violation.user_message}")
+```
--- a/docs/source/building_applications/telemetry.md
+++ b/docs/source/building_applications/telemetry.md
@ -0,0 +1,77 @@
+## Telemetry
+
+The Llama Stack telemetry system provides comprehensive tracing, metrics, and logging capabilities. It supports multiple sink types including OpenTelemetry, SQLite, and Console output.
+
+### Events
+The telemetry system supports three main types of events:
+
+- **Unstructured Log Events**: Free-form log messages with severity levels
+```python
+unstructured_log_event = UnstructuredLogEvent(
+    message="This is a log message",
+    severity=LogSeverity.INFO
+)
+```
+- **Metric Events**: Numerical measurements with units
+```python
+metric_event = MetricEvent(
+    metric="my_metric",
+    value=10,
+    unit="count"
+)
+```
+- **Structured Log Events**: System events like span start/end. Extensible to add more structured log types.
+```python
+structured_log_event = SpanStartPayload(
+    name="my_span",
+    parent_span_id="parent_span_id"
+)
+```
+
+### Spans and Traces
+- **Spans**: Represent operations with timing and hierarchical relationships
+- **Traces**: Collection of related spans forming a complete request flow
+
+### Sinks
+- **OpenTelemetry**: Send events to an OpenTelemetry Collector. This is useful for visualizing traces in a tool like Jaeger.
+- **SQLite**: Store events in a local SQLite database. This is needed if you want to query the events later through the Llama Stack API.
+- **Console**: Print events to the console.
+
+### Providers
+
+#### Meta-Reference Provider
+Currently, only the meta-reference provider is implemented. It can be configured to send events to three sink types:
+1) OpenTelemetry Collector
+2) SQLite
+3) Console
+
+#### Configuration
+
+Here's an example that sends telemetry signals to all three sink types. Your configuration might use only one.
+```yaml
+  telemetry:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      sinks: ['console', 'sqlite', 'otel']
+      otel_endpoint: "http://localhost:4318/v1/traces"
+      sqlite_db_path: "/path/to/telemetry.db"
+```
+
+### Jaeger to visualize traces
+
+The `otel` sink works with any service compatible with the OpenTelemetry collector. Let's use Jaeger to visualize this data.
+
+Start a Jaeger instance with the OTLP HTTP endpoint at 4318 and the Jaeger UI at 16686 using the following command:
+
+```bash
+$ docker run --rm --name jaeger \
+  -p 16686:16686 -p 4318:4318 \
+  jaegertracing/jaeger:2.1.0
+```
+
+Once the Jaeger instance is running, you can visualize traces by navigating to http://localhost:16686/.
+
+### Querying Traces Stored in SQLite
+
+The `sqlite` sink allows you to query traces without an external system. Here are some example queries. Refer to the notebook at [Llama Stack Building AI Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for more examples on how to query traces and spaces.
--- a/Show more
+++ b/Show more
				`@ -0,0 +1 @@`
				`../../llama_stack/templates/bedrock/run.yaml`
				`@ -0,0 +1 @@`
				`../../llama_stack/templates/cerebras/build.yaml`
				`@ -0,0 +1 @@`
				`../../llama_stack/templates/cerebras/run.yaml`
				`@ -1 +0,0 @@`
				`../../llama_stack/templates/databricks/build.yaml`
				`@ -0,0 +1 @@`
				`../../llama_stack/templates/fireworks/run.yaml`
				`@ -1 +0,0 @@`
				`../../llama_stack/templates/hf-endpoint/build.yaml`
				`@ -1 +0,0 @@`
				`../../llama_stack/templates/hf-serverless/build.yaml`
				`@ -0,0 +1 @@`
				`../../llama_stack/templates/meta-reference-gpu/run-with-safety.yaml`
				`@ -0,0 +1 @@`
				`../../llama_stack/templates/ollama/run-with-safety.yaml`
				`@ -0,0 +1 @@`
				`../../llama_stack/templates/nvidia/build.yaml`