Merge branch 'main' into add-mcp-streamable-http-support

2025-07-29 15:23:51 +00:00 · 2025-07-02 10:51:42 -04:00 · 2025-07-02 10:51:42 -04:00 · e027a526c9
commit e027a526c9
parent 8e5ab564b9 4d0d2d685f
81 changed files with 811 additions and 689 deletions
--- a/.github/workflows/integration-vector-io-tests.yml
+++ b/.github/workflows/integration-vector-io-tests.yml
@ -22,7 +22,7 @@ jobs:
    runs-on: ubuntu-latest
    strategy:
      matrix:
-        vector-io-provider: ["inline::faiss", "inline::sqlite-vec", "remote::chromadb", "remote::pgvector"]
+        vector-io-provider: ["inline::faiss", "inline::sqlite-vec", "inline::milvus", "remote::chromadb", "remote::pgvector"]
        python-version: ["3.12", "3.13"]
      fail-fast: false # we want to run all tests regardless of failure

--- a/docs/source/concepts/apis.md
+++ b/docs/source/concepts/apis.md
@ -14,5 +14,5 @@ A Llama Stack API is described as a collection of REST endpoints. We currently s
 We are working on adding a few more APIs to complete the application lifecycle. These will include:
 - **Batch Inference**: run inference on a dataset of inputs
 - **Batch Agents**: run agents on a dataset of inputs
- **Post Training**: fine-tune a Llama model
+- **Post Training**: fine-tune a model
 - **Synthetic Data Generation**: generate synthetic data for model development
--- a/docs/source/distributions/configuration.md
+++ b/docs/source/distributions/configuration.md
@ -67,7 +67,7 @@ Let's break this down into the different sections. The first section specifies t
 apis:
 - agents
 - inference
- memory
+- vector_io
 - safety
 - telemetry
 ```
@ -125,7 +125,7 @@ config:
 ```

 If the environment variable is not set, the default value `http://localhost:11434` will be used.
-Empty defaults are not allowed so `url: ${env.OLLAMA_URL:=}` will raise an error if the environment variable is not set.
+Empty defaults are allowed so `url: ${env.OLLAMA_URL:=}` will be set to `None` if the environment variable is not set.

 #### Conditional Values

@ -139,8 +139,10 @@ config:

 If the environment variable is set, the value after `:+` will be used. If it's not set, the field
 will be omitted with a `None` value.
-So `${env.ENVIRONMENT:+}` is supported, it means that the field will be omitted if the environment
-variable is not set. It can be used to make a field optional and then enabled at runtime when desired.
+
+Do not use conditional values (`${env.OLLAMA_URL:+}`) for empty defaults (`${env.OLLAMA_URL:=}`).
+This will be set to `None` if the environment variable is not set.
+Conditional must only be used when the environment variable is set.

 #### Examples

--- a/docs/source/providers/datasetio/remote_nvidia.md
+++ b/docs/source/providers/datasetio/remote_nvidia.md
@ -16,7 +16,7 @@ NVIDIA's dataset I/O provider for accessing datasets from NVIDIA's data platform
 ## Sample Configuration

 ```yaml
-api_key: ${env.NVIDIA_API_KEY:+}
+api_key: ${env.NVIDIA_API_KEY:=}
 dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
 project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
 datasets_url: ${env.NVIDIA_DATASETS_URL:=http://nemo.test}
--- a/docs/source/providers/external.md
+++ b/docs/source/providers/external.md
@ -1,4 +1,4 @@
-# External Providers
+# External Providers Guide

 Llama Stack supports external providers that live outside of the main codebase. This allows you to:
 - Create and maintain your own providers independently
--- a/docs/source/providers/index.md
+++ b/docs/source/providers/index.md
@ -13,7 +13,13 @@ Importantly, Llama Stack always strives to provide at least one fully inline pro

 ## External Providers

-Llama Stack supports external providers that live outside of the main codebase. This allows you to create and maintain your own providers independently. See the [External Providers Guide](external) for details.
+Llama Stack supports external providers that live outside of the main codebase. This allows you to create and maintain your own providers independently.
+
+```{toctree}
+:maxdepth: 1
+
+external
+```

 ## Agents
 Run multi-step agentic workflows with LLMs with tool usage, memory (RAG), etc.
--- a/docs/source/providers/inference/remote_nvidia.md
+++ b/docs/source/providers/inference/remote_nvidia.md
@ -17,7 +17,7 @@ NVIDIA inference provider for accessing NVIDIA NIM models and AI services.

 ```yaml
 url: ${env.NVIDIA_BASE_URL:=https://integrate.api.nvidia.com}
-api_key: ${env.NVIDIA_API_KEY:+}
+api_key: ${env.NVIDIA_API_KEY:=}
 append_api_version: ${env.NVIDIA_APPEND_API_VERSION:=True}

 ```
--- a/docs/source/providers/inference/remote_runpod.md
+++ b/docs/source/providers/inference/remote_runpod.md
@ -14,8 +14,8 @@ RunPod inference provider for running models on RunPod's cloud GPU platform.
 ## Sample Configuration

 ```yaml
-url: ${env.RUNPOD_URL:+}
-api_token: ${env.RUNPOD_API_TOKEN:+}
+url: ${env.RUNPOD_URL:=}
+api_token: ${env.RUNPOD_API_TOKEN:=}

 ```

--- a/docs/source/providers/inference/remote_together.md
+++ b/docs/source/providers/inference/remote_together.md
@ -15,7 +15,7 @@ Together AI inference provider for open-source models and collaborative AI devel

 ```yaml
 url: https://api.together.xyz/v1
-api_key: ${env.TOGETHER_API_KEY:+}
+api_key: ${env.TOGETHER_API_KEY:=}

 ```

--- a/docs/source/providers/inference/remote_watsonx.md
+++ b/docs/source/providers/inference/remote_watsonx.md
@ -17,8 +17,8 @@ IBM WatsonX inference provider for accessing AI models on IBM's WatsonX platform

 ```yaml
 url: ${env.WATSONX_BASE_URL:=https://us-south.ml.cloud.ibm.com}
-api_key: ${env.WATSONX_API_KEY:+}
-project_id: ${env.WATSONX_PROJECT_ID:+}
+api_key: ${env.WATSONX_API_KEY:=}
+project_id: ${env.WATSONX_PROJECT_ID:=}

 ```

--- a/docs/source/providers/post_training/remote_nvidia.md
+++ b/docs/source/providers/post_training/remote_nvidia.md
@ -19,7 +19,7 @@ NVIDIA's post-training provider for fine-tuning models on NVIDIA's platform.
 ## Sample Configuration

 ```yaml
-api_key: ${env.NVIDIA_API_KEY:+}
+api_key: ${env.NVIDIA_API_KEY:=}
 dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
 project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
 customizer_url: ${env.NVIDIA_CUSTOMIZER_URL:=http://nemo.test}
--- a/docs/source/providers/scoring/inline_braintrust.md
+++ b/docs/source/providers/scoring/inline_braintrust.md
@ -13,7 +13,7 @@ Braintrust scoring provider for evaluation and scoring using the Braintrust plat
 ## Sample Configuration

 ```yaml
-openai_api_key: ${env.OPENAI_API_KEY:+}
+openai_api_key: ${env.OPENAI_API_KEY:=}

 ```

--- a/docs/source/providers/tool_runtime/remote_brave-search.md
+++ b/docs/source/providers/tool_runtime/remote_brave-search.md
@ -14,7 +14,7 @@ Brave Search tool for web search capabilities with privacy-focused results.
 ## Sample Configuration

 ```yaml
-api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+api_key: ${env.BRAVE_SEARCH_API_KEY:=}
 max_results: 3

 ```
--- a/docs/source/providers/tool_runtime/remote_tavily-search.md
+++ b/docs/source/providers/tool_runtime/remote_tavily-search.md
@ -14,7 +14,7 @@ Tavily Search tool for AI-optimized web search with structured results.
 ## Sample Configuration

 ```yaml
-api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+api_key: ${env.TAVILY_SEARCH_API_KEY:=}
 max_results: 3

 ```
--- a/docs/source/providers/tool_runtime/remote_wolfram-alpha.md
+++ b/docs/source/providers/tool_runtime/remote_wolfram-alpha.md
@ -13,7 +13,7 @@ Wolfram Alpha tool for computational knowledge and mathematical calculations.
 ## Sample Configuration

 ```yaml
-api_key: ${env.WOLFRAM_ALPHA_API_KEY:+}
+api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}

 ```

--- a/docs/source/providers/vector_io/inline_milvus.md
+++ b/docs/source/providers/vector_io/inline_milvus.md
@ -16,11 +16,11 @@ Please refer to the remote provider documentation.
 ## Sample Configuration

 ```yaml
-db_path: ${env.MILVUS_DB_PATH:=~/.llama/dummy/milvus.db}
+db_path: ${env.MILVUS_DB_PATH:=~/.llama/dummy}/milvus.db
 kvstore:
  type: sqlite
  namespace: null
-  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/${env.MILVUS_KVSTORE_DB_PATH:=~/.llama/dummy/milvus_registry.db}
+  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/milvus_registry.db

 ```

--- a/docs/zero_to_hero_guide/00_Inference101.ipynb
+++ b/docs/zero_to_hero_guide/00_Inference101.ipynb
@ -7,7 +7,7 @@
      "source": [
        "# Llama Stack Inference Guide\n",
        "\n",
-        "This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.1-8B-Instruct` model. \n",
+        "This document provides instructions on how to use Llama Stack's `chat_completion` function for generating text using the `Llama3.2-3B-Instruct` model. \n",
        "\n",
        "Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html).\n",
        "\n",
--- a/docs/zero_to_hero_guide/01_Local_Cloud_Inference101.ipynb
+++ b/docs/zero_to_hero_guide/01_Local_Cloud_Inference101.ipynb
@ -26,7 +26,7 @@
  },
  {
   "cell_type": "code",
-      "execution_count": 1,
+   "execution_count": null,
   "id": "d80c0926",
   "metadata": {},
   "outputs": [],
@ -48,16 +48,16 @@
  },
  {
   "cell_type": "code",
-      "execution_count": 2,
+   "execution_count": null,
   "id": "7f868dfe",
   "metadata": {},
   "outputs": [],
   "source": [
-        "from llama_stack_client import LlamaStackClient\n",
+    "from llama_stack_client import AsyncLlamaStackClient\n",
    "\n",
    "# Configure local and cloud clients\n",
-        "local_client = LlamaStackClient(base_url=f'http://{HOST}:{LOCAL_PORT}')\n",
-        "cloud_client = LlamaStackClient(base_url=f'http://{HOST}:{CLOUD_PORT}')"
+    "local_client = AsyncLlamaStackClient(base_url=f'http://{HOST}:{LOCAL_PORT}')\n",
+    "cloud_client = AsyncLlamaStackClient(base_url=f'http://{HOST}:{CLOUD_PORT}')"
   ]
  },
  {
@ -67,23 +67,15 @@
   "source": [
    "#### 3. Client Selection with Fallback\n",
    "\n",
-        "The `select_client` function checks if the local client is available using a lightweight `/health` check. If the local client is unavailable, it automatically switches to the cloud client.\n"
+    "The `select_client` function checks if the local client is available using a lightweight `/v1/health` check. If the local client is unavailable, it automatically switches to the cloud client.\n"
   ]
  },
  {
   "cell_type": "code",
-      "execution_count": 3,
+   "execution_count": null,
   "id": "ff0c8277",
   "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "\u001b[33mUsing local client.\u001b[0m\n"
-          ]
-        }
-      ],
+   "outputs": [],
   "source": [
    "import httpx\n",
    "from termcolor import cprint\n",
@ -91,7 +83,7 @@
    "async def check_client_health(client, client_name: str) -> bool:\n",
    "    try:\n",
    "        async with httpx.AsyncClient() as http_client:\n",
-        "            response = await http_client.get(f'{client.base_url}/health')\n",
+    "            response = await http_client.get(f'{client.base_url}/v1/health')\n",
    "            if response.status_code == 200:\n",
    "                cprint(f'Using {client_name} client.', 'yellow')\n",
    "                return True\n",
@ -102,7 +94,7 @@
    "        cprint(f'Failed to connect to {client_name} client.', 'red')\n",
    "        return False\n",
    "\n",
-        "async def select_client(use_local: bool) -> LlamaStackClient:\n",
+    "async def select_client(use_local: bool) -> AsyncLlamaStackClient:\n",
    "    if use_local and await check_client_health(local_client, 'local'):\n",
    "        return local_client\n",
    "\n",
@ -127,13 +119,12 @@
  },
  {
   "cell_type": "code",
-      "execution_count": 4,
+   "execution_count": null,
   "id": "5e19cc20",
   "metadata": {},
   "outputs": [],
   "source": [
    "from termcolor import cprint\n",
-        "from llama_stack_client.lib.inference.event_logger import EventLogger\n",
    "\n",
    "async def get_llama_response(stream: bool = True, use_local: bool = True):\n",
    "    client = await select_client(use_local)  # Selects the available client\n",
@ -143,17 +134,19 @@
    "    }\n",
    "    cprint(f'User> {message[\"content\"]}', 'green')\n",
    "\n",
-        "    response = client.inference.chat_completion(\n",
+    "    response = await client.inference.chat_completion(\n",
    "        messages=[message],\n",
-        "        model='Llama3.2-11B-Vision-Instruct',\n",
+    "        model_id='meta-llama/Llama3.2-11B-Vision-Instruct',\n",
    "        stream=stream,\n",
    "    )\n",
    "\n",
+    "    cprint(f'Assistant> ', color='cyan', end='')\n",
    "    if not stream:\n",
-        "        cprint(f'> Response: {response.completion_message.content}', 'cyan')\n",
+    "        cprint(response.completion_message.content, color='yellow')\n",
    "    else:\n",
-        "        async for log in EventLogger().log(response):\n",
-        "            log.print()\n"
+    "        async for chunk in response:\n",
+    "            cprint(chunk.event.delta.text, color='yellow', end='')\n",
+    "        cprint('')"
   ]
  },
  {
@ -168,21 +161,10 @@
  },
  {
   "cell_type": "code",
-      "execution_count": 7,
+   "execution_count": null,
   "id": "c10f487e",
   "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "\u001b[33mUsing cloud client.\u001b[0m\n",
-            "\u001b[32mUser> hello world, write me a 2 sentence poem about the moon\u001b[0m\n",
-            "\u001b[36mAssistant> \u001b[0m\u001b[33mSilver\u001b[0m\u001b[33m cres\u001b[0m\u001b[33mcent\u001b[0m\u001b[33m in\u001b[0m\u001b[33m the\u001b[0m\u001b[33m midnight\u001b[0m\u001b[33m sky\u001b[0m\u001b[33m,\n",
-            "\u001b[0m\u001b[33mA\u001b[0m\u001b[33m gentle\u001b[0m\u001b[33m glow\u001b[0m\u001b[33m that\u001b[0m\u001b[33m whispers\u001b[0m\u001b[33m,\u001b[0m\u001b[33m \"\u001b[0m\u001b[33mI\u001b[0m\u001b[33m'm\u001b[0m\u001b[33m passing\u001b[0m\u001b[33m by\u001b[0m\u001b[33m.\"\u001b[0m\u001b[97m\u001b[0m\n"
-          ]
-        }
-      ],
+   "outputs": [],
   "source": [
    "import asyncio\n",
    "\n",
@ -203,21 +185,10 @@
  },
  {
   "cell_type": "code",
-      "execution_count": 8,
+   "execution_count": null,
   "id": "02eacfaf-c7f1-494b-ac28-129d2a0258e3",
   "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "\u001b[33mUsing local client.\u001b[0m\n",
-            "\u001b[32mUser> hello world, write me a 2 sentence poem about the moon\u001b[0m\n",
-            "\u001b[36mAssistant> \u001b[0m\u001b[33mSilver\u001b[0m\u001b[33m cres\u001b[0m\u001b[33mcent\u001b[0m\u001b[33m in\u001b[0m\u001b[33m the\u001b[0m\u001b[33m midnight\u001b[0m\u001b[33m sky\u001b[0m\u001b[33m,\n",
-            "\u001b[0m\u001b[33mA\u001b[0m\u001b[33m gentle\u001b[0m\u001b[33m glow\u001b[0m\u001b[33m that\u001b[0m\u001b[33m whispers\u001b[0m\u001b[33m,\u001b[0m\u001b[33m \"\u001b[0m\u001b[33mI\u001b[0m\u001b[33m'm\u001b[0m\u001b[33m passing\u001b[0m\u001b[33m by\u001b[0m\u001b[33m.\"\u001b[0m\u001b[97m\u001b[0m\n"
-          ]
-        }
-      ],
+   "outputs": [],
   "source": [
    "import asyncio\n",
    "\n",
@ -233,6 +204,12 @@
    "\n",
    "The next one will be a guide on [Prompt Engineering](./02_Prompt_Engineering101.ipynb), please continue learning!"
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3ad6db48",
+   "metadata": {},
+   "source": []
  }
 ],
 "metadata": {
--- a/docs/zero_to_hero_guide/03_Image_Chat101.ipynb
+++ b/docs/zero_to_hero_guide/03_Image_Chat101.ipynb
@ -23,8 +23,6 @@
        "import base64\n",
        "import mimetypes\n",
        "from llama_stack_client import LlamaStackClient\n",
-        "from llama_stack_client.lib.inference.event_logger import EventLogger\n",
-        "from llama_stack_client.types import UserMessage\n",
        "from termcolor import cprint"
      ]
    },
@ -45,8 +43,8 @@
      "outputs": [],
      "source": [
        "HOST = \"localhost\"  # Replace with your host\n",
-        "CLOUD_PORT = 8321       # Replace with your cloud distro port\n",
-        "MODEL_NAME='Llama3.2-11B-Vision-Instruct'"
+        "PORT = 8321         # Replace with your cloud distro port\n",
+        "MODEL_NAME='meta-llama/Llama3.2-11B-Vision-Instruct'"
      ]
    },
    {
@ -65,11 +63,6 @@
      "metadata": {},
      "outputs": [],
      "source": [
-        "import base64\n",
-        "import mimetypes\n",
-        "from termcolor import cprint\n",
-        "from llama_stack_client.lib.inference.event_logger import EventLogger\n",
-        "\n",
        "def encode_image_to_data_url(file_path: str) -> str:\n",
        "    \"\"\"\n",
        "    Encode an image file to a data URL.\n",
@ -103,8 +96,8 @@
        "    message = {\n",
        "        \"role\": \"user\",\n",
        "        \"content\": [\n",
-        "            {\"image\": {\"uri\": data_url}},\n",
-        "            \"Describe what is in this image.\"\n",
+        "            {\"type\": \"image\", \"image\": {\"url\": {\"uri\": data_url}}},\n",
+        "            {\"type\": \"text\", \"text\": \"Describe what is in this image.\"}\n",
        "        ]\n",
        "    }\n",
        "\n",
@ -115,11 +108,13 @@
        "        stream=stream,\n",
        "    )\n",
        "\n",
+        "    cprint(f'Assistant> ', color='cyan', end='')\n",
        "    if not stream:\n",
-        "        cprint(f\"> Response: {response}\", \"cyan\")\n",
+        "        cprint(response.completion_message.content, color='yellow')\n",
        "    else:\n",
-        "        async for log in EventLogger().log(response):\n",
-        "            log.print()\n"
+        "        for chunk in response:\n",
+        "            cprint(chunk.event.delta.text, color='yellow', end='')\n",
+        "        cprint('')\n"
      ]
    },
    {
@ -134,23 +129,10 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 6,
+      "execution_count": null,
      "id": "64d36476-95d7-49f9-a548-312cf8d8c49e",
      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "\u001b[32mUser> Sending image for analysis...\u001b[0m\n",
-            "\u001b[36mAssistant> \u001b[0m\u001b[33mThe\u001b[0m\u001b[33m image\u001b[0m\u001b[33m features\u001b[0m\u001b[33m a\u001b[0m\u001b[33m simple\u001b[0m\u001b[33m,\u001b[0m\u001b[33m mon\u001b[0m\u001b[33moch\u001b[0m\u001b[33mromatic\u001b[0m\u001b[33m line\u001b[0m\u001b[33m drawing\u001b[0m\u001b[33m of\u001b[0m\u001b[33m a\u001b[0m\u001b[33m llama\u001b[0m\u001b[33m,\u001b[0m\u001b[33m with\u001b[0m\u001b[33m the\u001b[0m\u001b[33m words\u001b[0m\u001b[33m \"\u001b[0m\u001b[33mLL\u001b[0m\u001b[33mAMA\u001b[0m\u001b[33m STACK\u001b[0m\u001b[33m\"\u001b[0m\u001b[33m written\u001b[0m\u001b[33m above\u001b[0m\u001b[33m it\u001b[0m\u001b[33m.\u001b[0m\u001b[33m The\u001b[0m\u001b[33m llama\u001b[0m\u001b[33m is\u001b[0m\u001b[33m depicted\u001b[0m\u001b[33m in\u001b[0m\u001b[33m a\u001b[0m\u001b[33m cartoon\u001b[0m\u001b[33mish\u001b[0m\u001b[33m style\u001b[0m\u001b[33m,\u001b[0m\u001b[33m with\u001b[0m\u001b[33m a\u001b[0m\u001b[33m large\u001b[0m\u001b[33m body\u001b[0m\u001b[33m and\u001b[0m\u001b[33m a\u001b[0m\u001b[33m long\u001b[0m\u001b[33m neck\u001b[0m\u001b[33m.\u001b[0m\u001b[33m It\u001b[0m\u001b[33m has\u001b[0m\u001b[33m a\u001b[0m\u001b[33m distinctive\u001b[0m\u001b[33m head\u001b[0m\u001b[33m shape\u001b[0m\u001b[33m,\u001b[0m\u001b[33m with\u001b[0m\u001b[33m a\u001b[0m\u001b[33m small\u001b[0m\u001b[33m circle\u001b[0m\u001b[33m for\u001b[0m\u001b[33m the\u001b[0m\u001b[33m eye\u001b[0m\u001b[33m and\u001b[0m\u001b[33m a\u001b[0m\u001b[33m curved\u001b[0m\u001b[33m line\u001b[0m\u001b[33m for\u001b[0m\u001b[33m the\u001b[0m\u001b[33m mouth\u001b[0m\u001b[33m.\u001b[0m\u001b[33m The\u001b[0m\u001b[33m llama\u001b[0m\u001b[33m's\u001b[0m\u001b[33m body\u001b[0m\u001b[33m is\u001b[0m\u001b[33m composed\u001b[0m\u001b[33m of\u001b[0m\u001b[33m several\u001b[0m\u001b[33m rounded\u001b[0m\u001b[33m shapes\u001b[0m\u001b[33m,\u001b[0m\u001b[33m giving\u001b[0m\u001b[33m it\u001b[0m\u001b[33m a\u001b[0m\u001b[33m soft\u001b[0m\u001b[33m and\u001b[0m\u001b[33m cudd\u001b[0m\u001b[33mly\u001b[0m\u001b[33m appearance\u001b[0m\u001b[33m.\n",
-            "\n",
-            "\u001b[0m\u001b[33mThe\u001b[0m\u001b[33m words\u001b[0m\u001b[33m \"\u001b[0m\u001b[33mLL\u001b[0m\u001b[33mAMA\u001b[0m\u001b[33m STACK\u001b[0m\u001b[33m\"\u001b[0m\u001b[33m are\u001b[0m\u001b[33m written\u001b[0m\u001b[33m in\u001b[0m\u001b[33m a\u001b[0m\u001b[33m playful\u001b[0m\u001b[33m,\u001b[0m\u001b[33m handwritten\u001b[0m\u001b[33m font\u001b[0m\u001b[33m above\u001b[0m\u001b[33m the\u001b[0m\u001b[33m llama\u001b[0m\u001b[33m's\u001b[0m\u001b[33m head\u001b[0m\u001b[33m.\u001b[0m\u001b[33m The\u001b[0m\u001b[33m text\u001b[0m\u001b[33m is\u001b[0m\u001b[33m also\u001b[0m\u001b[33m in\u001b[0m\u001b[33m a\u001b[0m\u001b[33m mon\u001b[0m\u001b[33moch\u001b[0m\u001b[33mromatic\u001b[0m\u001b[33m color\u001b[0m\u001b[33m scheme\u001b[0m\u001b[33m,\u001b[0m\u001b[33m matching\u001b[0m\u001b[33m the\u001b[0m\u001b[33m llama\u001b[0m\u001b[33m's\u001b[0m\u001b[33m outline\u001b[0m\u001b[33m.\u001b[0m\u001b[33m The\u001b[0m\u001b[33m background\u001b[0m\u001b[33m of\u001b[0m\u001b[33m the\u001b[0m\u001b[33m image\u001b[0m\u001b[33m is\u001b[0m\u001b[33m a\u001b[0m\u001b[33m solid\u001b[0m\u001b[33m black\u001b[0m\u001b[33m color\u001b[0m\u001b[33m,\u001b[0m\u001b[33m which\u001b[0m\u001b[33m provides\u001b[0m\u001b[33m a\u001b[0m\u001b[33m clean\u001b[0m\u001b[33m and\u001b[0m\u001b[33m simple\u001b[0m\u001b[33m contrast\u001b[0m\u001b[33m to\u001b[0m\u001b[33m the\u001b[0m\u001b[33m llama\u001b[0m\u001b[33m's\u001b[0m\u001b[33m design\u001b[0m\u001b[33m.\n",
-            "\n",
-            "\u001b[0m\u001b[33mOverall\u001b[0m\u001b[33m,\u001b[0m\u001b[33m the\u001b[0m\u001b[33m image\u001b[0m\u001b[33m appears\u001b[0m\u001b[33m to\u001b[0m\u001b[33m be\u001b[0m\u001b[33m a\u001b[0m\u001b[33m logo\u001b[0m\u001b[33m or\u001b[0m\u001b[33m icon\u001b[0m\u001b[33m for\u001b[0m\u001b[33m a\u001b[0m\u001b[33m brand\u001b[0m\u001b[33m or\u001b[0m\u001b[33m product\u001b[0m\u001b[33m called\u001b[0m\u001b[33m \"\u001b[0m\u001b[33mL\u001b[0m\u001b[33mlama\u001b[0m\u001b[33m Stack\u001b[0m\u001b[33m.\"\u001b[0m\u001b[33m The\u001b[0m\u001b[33m use\u001b[0m\u001b[33m of\u001b[0m\u001b[33m a\u001b[0m\u001b[33m cartoon\u001b[0m\u001b[33m llama\u001b[0m\u001b[33m and\u001b[0m\u001b[33m a\u001b[0m\u001b[33m playful\u001b[0m\u001b[33m font\u001b[0m\u001b[33m suggests\u001b[0m\u001b[33m a\u001b[0m\u001b[33m l\u001b[0m\u001b[33migh\u001b[0m\u001b[33mthe\u001b[0m\u001b[33mart\u001b[0m\u001b[33med\u001b[0m\u001b[33m and\u001b[0m\u001b[33m humorous\u001b[0m\u001b[33m tone\u001b[0m\u001b[33m,\u001b[0m\u001b[33m while\u001b[0m\u001b[33m the\u001b[0m\u001b[33m mon\u001b[0m\u001b[33moch\u001b[0m\u001b[33mromatic\u001b[0m\u001b[33m color\u001b[0m\u001b[33m scheme\u001b[0m\u001b[33m gives\u001b[0m\u001b[33m the\u001b[0m\u001b[33m image\u001b[0m\u001b[33m a\u001b[0m\u001b[33m clean\u001b[0m\u001b[33m and\u001b[0m\u001b[33m modern\u001b[0m\u001b[33m feel\u001b[0m\u001b[33m.\u001b[0m\u001b[97m\u001b[0m\n"
-          ]
-        }
-      ],
+      "outputs": [],
      "source": [
        "# [Cell 5] - Initialize client and process image\n",
        "async def main():\n",
@ -184,7 +166,7 @@
    "fileUid": "37bbbfda-8e42-446c-89c7-59dd49e2d339",
    "isAdHoc": false,
    "kernelspec": {
-      "display_name": "base",
+      "display_name": "llama-stack",
      "language": "python",
      "name": "python3"
    },
@ -198,7 +180,7 @@
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
-      "version": "3.12.2"
+      "version": "3.12.11"
    }
  },
  "nbformat": 4,
--- a/docs/zero_to_hero_guide/04_Tool_Calling101.ipynb
+++ b/docs/zero_to_hero_guide/04_Tool_Calling101.ipynb
@ -36,7 +36,7 @@
        "from dotenv import load_dotenv\n",
        "from llama_stack_client import LlamaStackClient\n",
        "from llama_stack_client.lib.agents.agent import Agent\n",
-        "from llama_stack_client.lib.agents.custom_tool import CustomTool\n",
+        "from llama_stack_client.lib.agents.client_tool import ClientTool\n",
        "from llama_stack_client.lib.agents.event_logger import EventLogger\n",
        "from llama_stack_client.types import CompletionMessage\n",
        "from llama_stack_client.types.agent_create_params import AgentConfig\n",
@ -129,7 +129,7 @@
      "source": [
        "## Step 3: Create a Custom Tool Class\n",
        "\n",
-        "Here, we defines the `WebSearchTool` class, which extends `CustomTool` to integrate the Brave Search API with Llama Stack, enabling web search capabilities within AI workflows. The class handles incoming user queries, interacts with the `BraveSearch` class for data retrieval, and formats results for effective response generation."
+        "Here, we defines the `WebSearchTool` class, which extends `ClientTool` to integrate the Brave Search API with Llama Stack, enabling web search capabilities within AI workflows. The class handles incoming user queries, interacts with the `BraveSearch` class for data retrieval, and formats results for effective response generation."
      ]
    },
    {
@ -139,7 +139,7 @@
      "metadata": {},
      "outputs": [],
      "source": [
-        "class WebSearchTool(CustomTool):\n",
+        "class WebSearchTool(ClientTool):\n",
        "    def __init__(self, api_key: str):\n",
        "        self.api_key = api_key\n",
        "        self.engine = BraveSearch(api_key)\n",
--- a/docs/zero_to_hero_guide/05_Memory101.ipynb
+++ b/docs/zero_to_hero_guide/05_Memory101.ipynb
@ -4,26 +4,26 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "## Memory "
+        "## Vector Database (VectorDB) and Vector I/O (VectorIO)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "Getting Started with Memory API Tutorial 🚀\n",
-        "Welcome! This interactive tutorial will guide you through using the Memory API, a powerful tool for document storage and retrieval. Whether you're new to vector databases or an experienced developer, this notebook will help you understand the basics and get up and running quickly.\n",
+        "Getting Started with VectorDB and VectorIO APIs Tutorial 🚀\n",
+        "Welcome! This interactive tutorial will guide you through using the VectorDB and VectorIO APIs, powerful tools for document storage and retrieval. Whether you're new to vector databases or an experienced developer, this notebook will help you understand the basics and get up and running quickly.\n",
        "What you'll learn:\n",
        "\n",
-        "How to set up and configure the Memory API client\n",
-        "Creating and managing memory banks (vector stores)\n",
+        "How to set up and configure the VectorDB and VectorIO client\n",
+        "Creating and managing vector databases\n",
        "Different ways to insert documents into the system\n",
        "How to perform intelligent queries on your documents\n",
        "\n",
        "Prerequisites:\n",
        "\n",
        "Basic Python knowledge\n",
-        "A running instance of the Memory API server (we'll use localhost in \n",
+        "A running instance of the Llama Stack server (we'll use localhost in \n",
        "this tutorial)\n",
        "\n",
        "Before you begin, please ensure Llama Stack is installed and set up by following the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html).\n",
@ -40,19 +40,19 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 1,
+      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "HOST = \"localhost\"  # Replace with your host\n",
        "PORT = 8321        # Replace with your port\n",
        "MODEL_NAME='meta-llama/Llama-3.2-3B-Instruct'\n",
-        "MEMORY_BANK_ID=\"tutorial_bank\""
+        "VECTOR_DB_ID=\"tutorial_db\""
      ]
    },
    {
      "cell_type": "code",
-      "execution_count": 2,
+      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
@ -71,7 +71,7 @@
        "\n",
        "First, we'll import the necessary libraries and set up some helper functions. Let's break down what each import does:\n",
        "\n",
-        "llama_stack_client: Our main interface to the Memory API\n",
+        "llama_stack_client: Our main interface to the VectorDB and VectorIO APIs\n",
        "base64: Helps us encode files for transmission\n",
        "mimetypes: Determines file types automatically\n",
        "termcolor: Makes our output prettier with colors\n",
@ -82,7 +82,7 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 3,
+      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
@ -90,10 +90,12 @@
        "import json\n",
        "import mimetypes\n",
        "import os\n",
+        "import requests\n",
        "from pathlib import Path\n",
        "\n",
        "from llama_stack_client import LlamaStackClient\n",
-        "from llama_stack_client.types.memory_insert_params import Document\n",
+        "from llama_stack_client.types import Document\n",
+        "from llama_stack_client.types.vector_io_insert_params import Chunk\n",
        "from termcolor import cprint\n",
        "\n",
        "# Helper function to convert files to data URLs\n",
@ -121,16 +123,32 @@
        "    mime_type, _ = mimetypes.guess_type(file_path)\n",
        "\n",
        "    data_url = f\"data:{mime_type};base64,{base64_content}\"\n",
-        "    return data_url"
+        "    return data_url\n",
+        "\n",
+        "# Helper function to download content from URLs\n",
+        "def download_from_url(url: str) -> str:\n",
+        "    \"\"\"Download content from a URL\n",
+        "\n",
+        "    Args:\n",
+        "        url (str): URL to download content from\n",
+        "\n",
+        "    Returns:\n",
+        "        str: Content of the URL\n",
+        "    \"\"\"\n",
+        "    response = requests.get(url)\n",
+        "    if response.status_code == 200:\n",
+        "        return response.text\n",
+        "    else:\n",
+        "        raise Exception(f\"Failed to download content from {url}: {response.status_code}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "2. **Initialize Client and Create Memory Bank**\n",
+        "2. **Initialize Client and Create Vector Database**\n",
        "\n",
-        "Now we'll set up our connection to the Memory API and create our first memory bank. A memory bank is like a specialized database that stores document embeddings for semantic search.\n",
+        "Now we'll set up our connection to the VectorDB API and create our first vector database. A vector database is a specialized database that stores document embeddings for semantic search.\n",
        "❓ Key Concepts:\n",
        "\n",
        "embedding_model: The model used to convert text into vector representations\n",
@ -142,18 +160,9 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 4,
+      "execution_count": null,
      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Available providers:\n",
-            "{'inference': [ProviderInfo(provider_id='ollama', provider_type='remote::ollama')], 'memory': [ProviderInfo(provider_id='faiss', provider_type='inline::faiss')], 'safety': [ProviderInfo(provider_id='llama-guard', provider_type='inline::llama-guard')], 'agents': [ProviderInfo(provider_id='meta-reference', provider_type='inline::meta-reference')], 'telemetry': [ProviderInfo(provider_id='meta-reference', provider_type='inline::meta-reference')]}\n"
-          ]
-        }
-      ],
+      "outputs": [],
      "source": [
        "# Initialize client\n",
        "client = LlamaStackClient(\n",
@ -163,18 +172,16 @@
        "# Let's see what providers are available\n",
        "# Providers determine where and how your data is stored\n",
        "providers = client.providers.list()\n",
-        "provider_id = providers[\"memory\"][0].provider_id\n",
+        "vector_io_providers = [p for p in providers if p.api == \"vector_io\"]\n",
+        "provider_id = vector_io_providers[0].provider_id if vector_io_providers else None\n",
        "print(\"Available providers:\")\n",
-        "#print(json.dumps(providers, indent=2))\n",
        "print(providers)\n",
-        "# Create a memory bank with optimized settings for general use\n",
-        "client.memory_banks.register(\n",
-        "    memory_bank_id=MEMORY_BANK_ID,\n",
-        "    params={\n",
-        "        \"embedding_model\": \"all-MiniLM-L6-v2\",\n",
-        "        \"chunk_size_in_tokens\": 512,\n",
-        "        \"overlap_size_in_tokens\": 64,\n",
-        "    },\n",
+        "\n",
+        "# Create a vector database with optimized settings for general use\n",
+        "client.vector_dbs.register(\n",
+        "    vector_db_id=VECTOR_DB_ID,\n",
+        "    embedding_model=\"all-MiniLM-L6-v2\",\n",
+        "    embedding_dimension=384,  # This is the dimension for all-MiniLM-L6-v2\n",
        "    provider_id=provider_id,\n",
        ")"
      ]
@ -185,7 +192,7 @@
      "source": [
        "3. **Insert Documents**\n",
        "   \n",
-        "The Memory API supports multiple ways to add documents. We'll demonstrate two common approaches:\n",
+        "The VectorIO API supports multiple ways to add documents. We'll demonstrate two common approaches:\n",
        "\n",
        "Loading documents from URLs\n",
        "Loading documents from local files\n",
@ -199,17 +206,9 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 5,
+      "execution_count": null,
      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Documents inserted successfully!\n"
-          ]
-        }
-      ],
+      "outputs": [],
      "source": [
        "# Example URLs to documentation\n",
        "# 💡 Replace these with your own URLs or use the examples\n",
@ -221,48 +220,86 @@
        "\n",
        "# Create documents from URLs\n",
        "# We add metadata to help organize our documents\n",
-        "url_documents = [\n",
-        "    Document(\n",
+        "url_documents = []\n",
+        "for i, url in enumerate(urls):\n",
+        "    full_url = f\"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}\"\n",
+        "    try:\n",
+        "        # Download content from URL\n",
+        "        content = download_from_url(full_url)\n",
+        "        # Create document with the downloaded content\n",
+        "        document = Document(\n",
        "            document_id=f\"url-doc-{i}\",  # Unique ID for each document\n",
-        "        content=f\"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}\",\n",
+        "            content=content,  # Use the actual content instead of the URL\n",
        "            mime_type=\"text/plain\",\n",
-        "        metadata={\"source\": \"url\", \"filename\": url},  # Metadata helps with organization\n",
+        "            metadata={\"source\": \"url\", \"filename\": url, \"original_url\": full_url},  # Store original URL in metadata\n",
        "        )\n",
-        "    for i, url in enumerate(urls)\n",
-        "]\n",
+        "        url_documents.append(document)\n",
+        "        print(f\"Successfully downloaded content from {url}\")\n",
+        "    except Exception as e:\n",
+        "        print(f\"Failed to download content from {url}: {e}\")\n",
        "\n",
        "# Example with local files\n",
        "# 💡 Replace these with your actual files\n",
        "local_files = [\"example.txt\", \"readme.md\"]\n",
-        "file_documents = [\n",
-        "    Document(\n",
+        "file_documents = []\n",
+        "for i, path in enumerate(local_files):\n",
+        "    if os.path.exists(path):\n",
+        "        try:\n",
+        "            # Read content from file directly instead of using data URL\n",
+        "            with open(path, 'r') as file:\n",
+        "                content = file.read()\n",
+        "            document = Document(\n",
        "                document_id=f\"file-doc-{i}\",\n",
-        "        content=data_url_from_file(path),\n",
+        "                content=content,  # Use the actual content directly\n",
+        "                mime_type=\"text/plain\",\n",
        "                metadata={\"source\": \"local\", \"filename\": path},\n",
        "            )\n",
-        "    for i, path in enumerate(local_files)\n",
-        "    if os.path.exists(path)\n",
-        "]\n",
+        "            file_documents.append(document)\n",
+        "            print(f\"Successfully read content from {path}\")\n",
+        "        except Exception as e:\n",
+        "            print(f\"Failed to read content from {path}: {e}\")\n",
        "\n",
        "# Combine all documents\n",
        "all_documents = url_documents + file_documents\n",
        "\n",
-        "# Insert documents into memory bank\n",
-        "response = client.memory.insert(\n",
-        "    bank_id= MEMORY_BANK_ID,\n",
-        "    documents=all_documents,\n",
-        ")\n",
+        "# Create chunks from the documents\n",
+        "chunks = []\n",
+        "for doc in all_documents:\n",
+        "    # Split document content into chunks of 512 characters\n",
+        "    content = doc.content\n",
+        "    chunk_size = 512\n",
+        "    \n",
+        "    # Create chunks of the specified size\n",
+        "    for i in range(0, len(content), chunk_size):\n",
+        "        chunk_content = content[i:i+chunk_size]\n",
+        "        if chunk_content.strip():  # Only add non-empty chunks\n",
+        "            chunks.append(Chunk(\n",
+        "                content=chunk_content,\n",
+        "                metadata={\n",
+        "                    \"document_id\": doc.document_id,\n",
+        "                    \"chunk_index\": i // chunk_size,\n",
+        "                    **doc.metadata\n",
+        "                }\n",
+        "            ))\n",
        "\n",
-        "print(\"Documents inserted successfully!\")"
+        "# Insert chunks into vector database\n",
+        "if chunks:  # Only proceed if we have valid chunks\n",
+        "    client.vector_io.insert(\n",
+        "        vector_db_id=VECTOR_DB_ID,\n",
+        "        chunks=chunks,\n",
+        "    )\n",
+        "    print(f\"Documents inserted successfully! ({len(chunks)} chunks)\")\n",
+        "else:\n",
+        "    print(\"No valid documents to insert.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "4. **Query the Memory Bank**\n",
+        "4. **Query the Vector Database**\n",
        "   \n",
-        "Now for the exciting part - querying our documents! The Memory API uses semantic search to find relevant content based on meaning, not just keywords.\n",
+        "Now for the exciting part - querying our documents! The VectorIO API uses semantic search to find relevant content based on meaning, not just keywords.\n",
        "❓ Understanding Scores:\n",
        "\n",
        "Generally, scores above 0.7 indicate strong relevance\n",
@ -271,70 +308,9 @@
    },
    {
      "cell_type": "code",
-      "execution_count": 6,
+      "execution_count": null,
      "metadata": {},
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "\n",
-            "Query: How do I use LoRA?\n",
-            "--------------------------------------------------\n",
-            "\n",
-            "Result 1 (Score: 1.166)\n",
-            "========================================\n",
-            "Chunk(content=\".md>`_ to see how they differ.\\n\\n\\n.. _glossary_peft:\\n\\nParameter Efficient Fine-Tuning (PEFT)\\n--------------------------------------\\n\\n.. _glossary_lora:\\n\\nLow Rank Adaptation (LoRA)\\n^^^^^^^^^^^^^^^^^^^^^^^^^^\\n\\n\\n*What's going on here?*\\n\\nYou can read our tutorial on :ref:`finetuning Llama2 with LoRA<lora_finetune_label>` to understand how LoRA works, and how to use it.\\nSimply stated, LoRA greatly reduces the number of trainable parameters, thus saving significant gradient and optimizer\\nmemory during training.\\n\\n*Sounds great! How do I use it?*\\n\\nYou can finetune using any of our recipes with the ``lora_`` prefix, e.g. :ref:`lora_finetune_single_device<lora_finetune_recipe_label>`. These recipes utilize\\nLoRA-enabled model builders, which we support for all our models, and also use the ``lora_`` prefix, e.g.\\nthe :func:`torchtune.models.llama3.llama3` model has a corresponding :func:`torchtune.models.llama3.lora_llama3`.\\nWe aim to provide a comprehensive set of configurations to allow you to get started with training with LoRA quickly,\\njust specify any config with ``_lora`` in its name, e.g:\\n\\n.. code-block:: bash\\n\\n  tune run lora_finetune_single_device --config llama3/8B_lora_single_device\\n\\n\\nThere are two sets of parameters to customize LoRA to suit your needs. Firstly, the parameters which control\\nwhich linear layers LoRA should be applied to in the model:\\n\\n* ``lora_attn_modules: List[str]`` accepts a list of strings specifying which layers of the model to apply\\n  LoRA to:\\n\\n  * ``q_proj`` applies LoRA to the query projection layer.\\n  * ``k_proj`` applies LoRA to the key projection layer.\\n  * ``v_proj`` applies LoRA to the value projection layer.\\n  * ``output_proj`` applies LoRA to the attention output projection layer.\\n\\n  Whilst adding more layers to be fine-tuned may improve model accuracy,\\n  this will come at the cost of increased memory usage and reduced training speed.\\n\\n* ``apply_lora_to_mlp: Bool`` applies LoRA to the MLP in each transformer layer.\\n* ``apply_lora_to_output: Bool`` applies LoRA to the model's final output projection.\\n  This is\", document_id='url-doc-0', token_count=512)\n",
-            "========================================\n",
-            "\n",
-            "Result 2 (Score: 1.049)\n",
-            "========================================\n",
-            "Chunk(content='ora_finetune_single_device --config llama3/8B_qlora_single_device \\\\\\n  model.apply_lora_to_mlp=True \\\\\\n  model.lora_attn_modules=[\"q_proj\",\"k_proj\",\"v_proj\"] \\\\\\n  model.lora_rank=32 \\\\\\n  model.lora_alpha=64\\n\\n\\nor, by modifying a config:\\n\\n.. code-block:: yaml\\n\\n  model:\\n    _component_: torchtune.models.qlora_llama3_8b\\n    apply_lora_to_mlp: True\\n    lora_attn_modules: [\"q_proj\", \"k_proj\", \"v_proj\"]\\n    lora_rank: 32\\n    lora_alpha: 64\\n\\n.. _glossary_dora:\\n\\nWeight-Decomposed Low-Rank Adaptation (DoRA)\\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n\\n*What\\'s going on here?*\\n\\n`DoRA <https://arxiv.org/abs/2402.09353>`_ is another PEFT technique which builds on-top of LoRA by\\nfurther decomposing the pre-trained weights into two components: magnitude and direction. The magnitude component\\nis a scalar vector that adjusts the scale, while the direction component corresponds to the original LoRA decomposition and\\nupdates the orientation of weights.\\n\\nDoRA adds a small overhead to LoRA training due to the addition of the magnitude parameter, but it has been shown to\\nimprove the performance of LoRA, particularly at low ranks.\\n\\n*Sounds great! How do I use it?*\\n\\nMuch like LoRA and QLoRA, you can finetune using DoRA with any of our LoRA recipes. We use the same model builders for LoRA\\nas we do for DoRA, so you can use the ``lora_`` version of any model builder with ``use_dora=True``. For example, to finetune\\n:func:`torchtune.models.llama3.llama3_8b` with DoRA, you would use :func:`torchtune.models.llama3.lora_llama3_8b` with ``use_dora=True``:\\n\\n.. code-block:: bash\\n\\n  tune run lora_finetune_single_device --config llama3/8B_lora_single_device \\\\\\n  model.use_dora=True\\n\\n.. code-block:: yaml\\n\\n  model:\\n    _component_: torchtune.models.lora_llama3_8b\\n    use_dora: True\\n\\nSince DoRA extends LoRA', document_id='url-doc-0', token_count=512)\n",
-            "========================================\n",
-            "\n",
-            "Result 3 (Score: 1.045)\n",
-            "========================================\n",
-            "Chunk(content='ora_finetune_single_device --config llama3/8B_lora_single_device \\\\\\n  model.use_dora=True\\n\\n.. code-block:: yaml\\n\\n  model:\\n    _component_: torchtune.models.lora_llama3_8b\\n    use_dora: True\\n\\nSince DoRA extends LoRA, the parameters for :ref:`customizing LoRA <glossary_lora>` are identical. You can also quantize the base model weights like in :ref:`glossary_qlora` by using ``quantize=True`` to reap\\neven more memory savings!\\n\\n.. code-block:: bash\\n\\n  tune run lora_finetune_single_device --config llama3/8B_lora_single_device \\\\\\n  model.apply_lora_to_mlp=True \\\\\\n  model.lora_attn_modules=[\"q_proj\",\"k_proj\",\"v_proj\"] \\\\\\n  model.lora_rank=16 \\\\\\n  model.lora_alpha=32 \\\\\\n  model.use_dora=True \\\\\\n  model.quantize_base=True\\n\\n.. code-block:: yaml\\n\\n  model:\\n    _component_: torchtune.models.lora_llama3_8b\\n    apply_lora_to_mlp: True\\n    lora_attn_modules: [\"q_proj\", \"k_proj\", \"v_proj\"]\\n    lora_rank: 16\\n    lora_alpha: 32\\n    use_dora: True\\n    quantize_base: True\\n\\n\\n.. note::\\n\\n   Under the hood, we\\'ve enabled DoRA by adding the :class:`~torchtune.modules.peft.DoRALinear` module, which we swap\\n   out for :class:`~torchtune.modules.peft.LoRALinear` when ``use_dora=True``.\\n\\n.. _glossary_distrib:\\n\\n\\n.. TODO\\n\\n.. Distributed\\n.. -----------\\n\\n.. .. _glossary_fsdp:\\n\\n.. Fully Sharded Data Parallel (FSDP)\\n.. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\n\\n.. All our ``_distributed`` recipes use `FSDP <https://pytorch.org/docs/stable/fsdp.html>`.\\n.. .. _glossary_fsdp2:\\n', document_id='url-doc-0', token_count=437)\n",
-            "========================================\n",
-            "\n",
-            "Query: Tell me about memory optimizations\n",
-            "--------------------------------------------------\n",
-            "\n",
-            "Result 1 (Score: 1.260)\n",
-            "========================================\n",
-            "Chunk(content='.. _memory_optimization_overview_label:\\n\\n============================\\nMemory Optimization Overview\\n============================\\n\\n**Author**: `Salman Mohammadi <https://github.com/SalmanMohammadi>`_\\n\\ntorchtune comes with a host of plug-and-play memory optimization components which give you lots of flexibility\\nto ``tune`` our recipes to your hardware. This page provides a brief glossary of these components and how you might use them.\\nTo make things easy, we\\'ve summarized these components in the following table:\\n\\n.. csv-table:: Memory optimization components\\n   :header: \"Component\", \"When to use?\"\\n   :widths: auto\\n\\n   \":ref:`glossary_precision`\", \"You\\'ll usually want to leave this as its default ``bfloat16``. It uses 2 bytes per model parameter instead of 4 bytes when using ``float32``.\"\\n   \":ref:`glossary_act_ckpt`\", \"Use when you\\'re memory constrained and want to use a larger model, batch size or context length. Be aware that it will slow down training speed.\"\\n   \":ref:`glossary_act_off`\", \"Similar to activation checkpointing, this can be used when memory constrained, but may decrease training speed. This **should** be used alongside activation checkpointing.\"\\n   \":ref:`glossary_grad_accm`\", \"Helpful when memory-constrained to simulate larger batch sizes. Not compatible with optimizer in backward. Use it when you can already fit at least one sample without OOMing, but not enough of them.\"\\n   \":ref:`glossary_low_precision_opt`\", \"Use when you want to reduce the size of the optimizer state. This is relevant when training large models and using optimizers with momentum, like Adam. Note that lower precision optimizers may reduce training stability/accuracy.\"\\n   \":ref:`glossary_opt_in_bwd`\", \"Use it when you have large gradients and can fit a large enough batch size, since this is not compatible with ``gradient_accumulation_steps``.\"\\n   \":ref:`glossary_cpu_offload`\", \"Offloads optimizer states and (optionally) gradients to CPU, and performs optimizer steps on CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed. Prioritize using it only if the other techniques are not enough.\"\\n   \":ref:`glossary_lora`\", \"When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory', document_id='url-doc-0', token_count=512)\n",
-            "========================================\n",
-            "\n",
-            "Result 2 (Score: 1.133)\n",
-            "========================================\n",
-            "Chunk(content=' CPU. This can be used to significantly reduce GPU memory usage at the cost of CPU RAM and training speed. Prioritize using it only if the other techniques are not enough.\"\\n   \":ref:`glossary_lora`\", \"When you want to significantly reduce the number of trainable parameters, saving gradient and optimizer memory during training, and significantly speeding up training. This may reduce training accuracy\"\\n   \":ref:`glossary_qlora`\", \"When you are training a large model, since quantization will save 1.5 bytes * (# of model parameters), at the potential cost of some training speed and accuracy.\"\\n   \":ref:`glossary_dora`\", \"a variant of LoRA that may improve model performance at the cost of slightly more memory.\"\\n\\n\\n.. note::\\n\\n  In its current state, this tutorial is focused on single-device optimizations. Check in soon as we update this page\\n  for the latest memory optimization features for distributed fine-tuning.\\n\\n.. _glossary_precision:\\n\\n\\nModel Precision\\n---------------\\n\\n*What\\'s going on here?*\\n\\nWe use the term \"precision\" to refer to the underlying data type used to represent the model and optimizer parameters.\\nWe support two data types in torchtune:\\n\\n.. note::\\n\\n  We recommend diving into Sebastian Raschka\\'s `blogpost on mixed-precision techniques <https://sebastianraschka.com/blog/2023/llm-mixed-precision-copy.html>`_\\n  for a deeper understanding of concepts around precision and data formats.\\n\\n* ``fp32``, commonly referred to as \"full-precision\", uses 4 bytes per model and optimizer parameter.\\n* ``bfloat16``, referred to as \"half-precision\", uses 2 bytes per model and optimizer parameter - effectively half\\n  the memory of ``fp32``, and also improves training speed. Generally, if your hardware supports training with ``bfloat16``,\\n  we recommend using it - this is the default setting for our recipes.\\n\\n.. note::\\n\\n  Another common paradigm is \"mixed-precision\" training: where model weights are in ``bfloat16`` (or ``fp16``), and optimizer\\n  states are in ``fp32``. Currently, we don\\'t support mixed-precision training in torchtune.\\n\\n*Sounds great! How do I use it?*\\n\\nSimply use the ``dtype`` flag or config entry in all our recipes! For example, to use half-precision training in ``bf16``,\\nset ``dtype=bf16``.\\n\\n.. _', document_id='url-doc-0', token_count=512)\n",
-            "========================================\n",
-            "\n",
-            "Result 3 (Score: 0.854)\n",
-            "========================================\n",
-            "Chunk(content=\"_steps * num_devices``\\n\\nGradient accumulation is especially useful when you can fit at least one sample in your GPU. In this case, artificially increasing the batch by\\naccumulating gradients might give you faster training speeds than using other memory optimization techniques that trade-off memory for speed, like :ref:`activation checkpointing <glossary_act_ckpt>`.\\n\\n*Sounds great! How do I use it?*\\n\\nAll of our finetuning recipes support simulating larger batch sizes by accumulating gradients. Just set the\\n``gradient_accumulation_steps`` flag or config entry.\\n\\n.. note::\\n\\n  Gradient accumulation should always be set to 1 when :ref:`fusing the optimizer step into the backward pass <glossary_opt_in_bwd>`.\\n\\nOptimizers\\n----------\\n\\n.. _glossary_low_precision_opt:\\n\\nLower Precision Optimizers\\n^^^^^^^^^^^^^^^^^^^^^^^^^^\\n\\n*What's going on here?*\\n\\nIn addition to :ref:`reducing model and optimizer precision <glossary_precision>` during training, we can further reduce precision in our optimizer states.\\nAll of our recipes support lower-precision optimizers from the `torchao <https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim>`_ library.\\nFor single device recipes, we also support `bitsandbytes <https://huggingface.co/docs/bitsandbytes/main/en/index>`_.\\n\\nA good place to start might be the :class:`torchao.prototype.low_bit_optim.AdamW8bit` and :class:`bitsandbytes.optim.PagedAdamW8bit` optimizers.\\nBoth reduce memory by quantizing the optimizer state dict. Paged optimizers will also offload to CPU if there isn't enough GPU memory available. In practice,\\nyou can expect higher memory savings from bnb's PagedAdamW8bit but higher training speed from torchao's AdamW8bit.\\n\\n*Sounds great! How do I use it?*\\n\\nTo use this in your recipes, make sure you have installed torchao (``pip install torchao``) or bitsandbytes (``pip install bitsandbytes``). Then, enable\\na low precision optimizer using the :ref:`cli_label`:\\n\\n\\n.. code-block:: bash\\n\\n  tune run <RECIPE> --config <CONFIG> \\\\\\n  optimizer=torchao.prototype.low_bit_optim.AdamW8bit\\n\\n.. code-block:: bash\\n\\n  tune run <RECIPE> --config <CONFIG> \\\\\\n  optimizer=bitsand\", document_id='url-doc-0', token_count=512)\n",
-            "========================================\n",
-            "\n",
-            "Query: What are the key features of Llama 3?\n",
-            "--------------------------------------------------\n",
-            "\n",
-            "Result 1 (Score: 0.964)\n",
-            "========================================\n",
-            "Chunk(content=\"8B uses a larger intermediate dimension in its MLP layers than Llama2-7B\\n- Llama3-8B uses a higher base value to calculate theta in its `rotary positional embeddings <https://arxiv.org/abs/2104.09864>`_\\n\\n|\\n\\nGetting access to Llama3-8B-Instruct\\n------------------------------------\\n\\nFor this tutorial, we will be using the instruction-tuned version of Llama3-8B. First, let's download the model from Hugging Face. You will need to follow the instructions\\non the `official Meta page <https://github.com/meta-llama/llama3/blob/main/README.md>`_ to gain access to the model.\\nNext, make sure you grab your Hugging Face token from `here <https://huggingface.co/settings/tokens>`_.\\n\\n\\n.. code-block:: bash\\n\\n    tune download meta-llama/Meta-Llama-3-8B-Instruct \\\\\\n        --output-dir <checkpoint_dir> \\\\\\n        --hf-token <ACCESS TOKEN>\\n\\n|\\n\\nFine-tuning Llama3-8B-Instruct in torchtune\\n-------------------------------------------\\n\\ntorchtune provides `LoRA <https://arxiv.org/abs/2106.09685>`_, `QLoRA <https://arxiv.org/abs/2305.14314>`_, and full fine-tuning\\nrecipes for fine-tuning Llama3-8B on one or more GPUs. For more on LoRA in torchtune, see our :ref:`LoRA Tutorial <lora_finetune_label>`.\\nFor more on QLoRA in torchtune, see our :ref:`QLoRA Tutorial <qlora_finetune_label>`.\\n\\nLet's take a look at how we can fine-tune Llama3-8B-Instruct with LoRA on a single device using torchtune. In this example, we will fine-tune\\nfor one epoch on a common instruct dataset for illustrative purposes. The basic command for a single-device LoRA fine-tune is\\n\\n.. code-block:: bash\\n\\n    tune run lora_finetune_single_device --config llama3/8B_lora_single_device\\n\\n.. note::\\n    To see a full list of recipes and their corresponding configs, simply run ``tune ls`` from the command line.\\n\\nWe can also add :ref:`command-line overrides <cli_override>` as needed, e.g.\\n\\n.. code-block:: bash\\n\\n    tune run lora\", document_id='url-doc-2', token_count=512)\n",
-            "========================================\n",
-            "\n",
-            "Result 2 (Score: 0.927)\n",
-            "========================================\n",
-            "Chunk(content=\".. _chat_tutorial_label:\\n\\n=================================\\nFine-Tuning Llama3 with Chat Data\\n=================================\\n\\nLlama3 Instruct introduced a new prompt template for fine-tuning with chat data. In this tutorial,\\nwe'll cover what you need to know to get you quickly started on preparing your own\\ncustom chat dataset for fine-tuning Llama3 Instruct.\\n\\n.. grid:: 2\\n\\n    .. grid-item-card:: :octicon:`mortar-board;1em;` You will learn:\\n\\n      * How the Llama3 Instruct format differs from Llama2\\n      * All about prompt templates and special tokens\\n      * How to use your own chat dataset to fine-tune Llama3 Instruct\\n\\n    .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites\\n\\n      * Be familiar with :ref:`configuring datasets<chat_dataset_usage_label>`\\n      * Know how to :ref:`download Llama3 Instruct weights <llama3_label>`\\n\\n\\nTemplate changes from Llama2 to Llama3\\n--------------------------------------\\n\\nThe Llama2 chat model requires a specific template when prompting the pre-trained\\nmodel. Since the chat model was pretrained with this prompt template, if you want to run\\ninference on the model, you'll need to use the same template for optimal performance\\non chat data. Otherwise, the model will just perform standard text completion, which\\nmay or may not align with your intended use case.\\n\\nFrom the `official Llama2 prompt\\ntemplate guide <https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-2>`_\\nfor the Llama2 chat model, we can see that special tags are added:\\n\\n.. code-block:: text\\n\\n    <s>[INST] <<SYS>>\\n    You are a helpful, respectful, and honest assistant.\\n    <</SYS>>\\n\\n    Hi! I am a human. [/INST] Hello there! Nice to meet you! I'm Meta AI, your friendly AI assistant </s>\\n\\nLlama3 Instruct `overhauled <https://llama.meta.com/docs/model-cards-and-prompt-formats/meta-llama-3>`_\\nthe template from Llama2 to better support multiturn conversations. The same text\\nin the Llama3 Instruct format would look like this:\\n\\n.. code-block:: text\\n\\n    <|begin_of_text|><|start_header_id|>system<|end_header_id|>\\n\\n    You are a helpful,\", document_id='url-doc-1', token_count=512)\n",
-            "========================================\n",
-            "\n",
-            "Result 3 (Score: 0.858)\n",
-            "========================================\n",
-            "Chunk(content='.. _llama3_label:\\n\\n========================\\nMeta Llama3 in torchtune\\n========================\\n\\n.. grid:: 2\\n\\n    .. grid-item-card:: :octicon:`mortar-board;1em;` You will learn how to:\\n\\n      * Download the Llama3-8B-Instruct weights and tokenizer\\n      * Fine-tune Llama3-8B-Instruct with LoRA and QLoRA\\n      * Evaluate your fine-tuned Llama3-8B-Instruct model\\n      * Generate text with your fine-tuned model\\n      * Quantize your model to speed up generation\\n\\n    .. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites\\n\\n      * Be familiar with :ref:`torchtune<overview_label>`\\n      * Make sure to :ref:`install torchtune<install_label>`\\n\\n\\nLlama3-8B\\n---------\\n\\n`Meta Llama 3 <https://llama.meta.com/llama3>`_ is a new family of models released by Meta AI that improves upon the performance of the Llama2 family\\nof models across a `range of different benchmarks <https://huggingface.co/meta-llama/Meta-Llama-3-8B#base-pretrained-models>`_.\\nCurrently there are two different sizes of Meta Llama 3: 8B and 70B. In this tutorial we will focus on the 8B size model.\\nThere are a few main changes between Llama2-7B and Llama3-8B models:\\n\\n- Llama3-8B uses `grouped-query attention <https://arxiv.org/abs/2305.13245>`_ instead of the standard multi-head attention from Llama2-7B\\n- Llama3-8B has a larger vocab size (128,256 instead of 32,000 from Llama2 models)\\n- Llama3-8B uses a different tokenizer than Llama2 models (`tiktoken <https://github.com/openai/tiktoken>`_ instead of `sentencepiece <https://github.com/google/sentencepiece>`_)\\n- Llama3-8B uses a larger intermediate dimension in its MLP layers than Llama2-7B\\n- Llama3-8B uses a higher base value to calculate theta in its `rotary positional embeddings <https://arxiv.org/abs/2104.09864>`_\\n\\n|\\n\\nGetting access to Llama3', document_id='url-doc-2', token_count=512)\n",
-            "========================================\n"
-          ]
-        }
-      ],
+      "outputs": [],
      "source": [
        "def print_query_results(query: str):\n",
        "    \"\"\"Helper function to print query results in a readable format\n",
@ -344,15 +320,15 @@
        "    \"\"\"\n",
        "    print(f\"\\nQuery: {query}\")\n",
        "    print(\"-\" * 50)\n",
-        "    response = client.memory.query(\n",
-        "        bank_id= MEMORY_BANK_ID,\n",
-        "        query=[query],  # The API accepts multiple queries at once!\n",
+        "    response = client.vector_io.query(\n",
+        "        vector_db_id=VECTOR_DB_ID,\n",
+        "        query=query,\n",
        "    )\n",
        "\n",
        "    for i, (chunk, score) in enumerate(zip(response.chunks, response.scores)):\n",
        "        print(f\"\\nResult {i+1} (Score: {score:.3f})\")\n",
        "        print(\"=\" * 40)\n",
-        "        print(chunk)\n",
+        "        print(chunk.content)\n",
        "        print(\"=\" * 40)\n",
        "\n",
        "# Let's try some example queries\n",
@ -371,7 +347,7 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-        "Awesome, now we can embed all our notes with Llama-stack and ask it about the meaning of life :)\n",
+        "Awesome, now we can embed all our notes with Llama-stack using VectorDB and VectorIO, and ask it about the meaning of life :)\n",
        "\n",
        "Next up, we will learn about the safety features and how to use them: [notebook link](./06_Safety101.ipynb)."
      ]
@ -382,7 +358,7 @@
    "fileUid": "73bc3357-0e5e-42ff-95b1-40b916d24c4f",
    "isAdHoc": false,
    "kernelspec": {
-      "display_name": "Python 3 (ipykernel)",
+      "display_name": "llama-stack",
      "language": "python",
      "name": "python3"
    },
@ -396,7 +372,7 @@
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
-      "version": "3.10.15"
+      "version": "3.12.11"
    }
  },
  "nbformat": 4,
--- a/docs/zero_to_hero_guide/README.md
+++ b/docs/zero_to_hero_guide/README.md
@ -2,9 +2,9 @@

 Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Providers providing their implementations. These building blocks are assembled into Distributions which are easy for developers to get from zero to production.

-This guide will walk you through an end-to-end workflow with Llama Stack with Ollama as the inference provider and ChromaDB as the memory provider. Please note the steps for configuring your provider and distribution will vary a little depending on the services you use. However, the user experience will remain universal - this is the power of Llama-Stack.
+This guide will walk you through an end-to-end workflow with Llama Stack with Ollama as the inference provider and ChromaDB as the VectorIO provider. Please note the steps for configuring your provider and distribution will vary depending on the services you use. However, the user experience will remain universal - this is the power of Llama-Stack.

-If you're looking for more specific topics, we have a [Zero to Hero Guide](#next-steps) that covers everything from Tool Calling to Agents in detail. Feel free to skip to the end to explore the advanced topics you're interested in.
+If you're looking for more specific topics, we have a [Zero to Hero Guide](#next-steps) that covers everything from 'Tool Calling' to 'Agents' in detail. Feel free to skip to the end to explore the advanced topics you're interested in.

 > If you'd prefer not to set up a local server, explore our notebook on [tool calling with the Together API](Tool_Calling101_Using_Together_Llama_Stack_Server.ipynb). This notebook will show you how to leverage together.ai's Llama Stack Server API, allowing you to get started with Llama Stack without the need for a locally built and running server.

@ -26,15 +26,15 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
   - Follow instructions based on the OS you are on. For example, if you are on a Mac, download and unzip `Ollama-darwin.zip`.
   - Run the `Ollama` application.

-1. **Download the Ollama CLI**:
+2. **Download the Ollama CLI**:
   Ensure you have the `ollama` command line tool by downloading and installing it from the same website.

-1. **Start ollama server**:
+3. **Start ollama server**:
   Open the terminal and run:
-   ```
+   ```bash
   ollama serve
   ```
-1. **Run the model**:
+4. **Run the model**:
   Open the terminal and run:
   ```bash
   ollama run llama3.2:3b-instruct-fp16 --keepalive -1m
@ -48,9 +48,9 @@ If you're looking for more specific topics, we have a [Zero to Hero Guide](#next
 ## Install Dependencies and Set Up Environment

 1. **Create a Conda Environment**:
-   Create a new Conda environment with Python 3.10:
+   Create a new Conda environment with Python 3.12:
   ```bash
-   conda create -n ollama python=3.10
+   conda create -n ollama python=3.12
   ```
   Activate the environment:
   ```bash
--- a/llama_stack/distribution/stack.py
+++ b/llama_stack/distribution/stack.py
@ -166,20 +166,31 @@ def replace_env_vars(config: Any, path: str = "") -> Any:
            env_value = os.environ.get(env_var)

            if operator == "=":  # Default value syntax: ${env.FOO:=default}
-                if not env_value:
-                    # value_expr returns empty string (not None) when not matched
-                    # This means ${env.FOO:=} is an error
-                    if value_expr == "":
-                        raise EnvVarError(env_var, path)
-                    else:
-                        value = value_expr
-                else:
-                    value = env_value
-            elif operator == "+":  # Conditional value syntax: ${env.FOO:+value_if_set}
+                # If the env is set like ${env.FOO:=default} then use the env value when set
                if env_value:
-                    value = value_expr
+                    value = env_value
                else:
-                    # If env var is not set, return empty string for the conditional case
+                    # If the env is not set, look for a default value
+                    # value_expr returns empty string (not None) when not matched
+                    # This means ${env.FOO:=} and it's accepted and returns empty string - just like bash
+                    if value_expr == "":
+                        return ""
+                    else:
+                        value = value_expr
+
+            elif operator == "+":  # Conditional value syntax: ${env.FOO:+value_if_set}
+                # If the env is set like ${env.FOO:+value_if_set} then use the value_if_set
+                if env_value:
+                    if value_expr:
+                        value = value_expr
+                    # This means ${env.FOO:+}
+                    else:
+                        # Just like bash, this doesn't care whether the env is set or not and applies
+                        # the value, in this case the empty string
+                        return ""
+                else:
+                    # Just like bash, this doesn't care whether the env is set or not, since it's not set
+                    # we return an empty string
                    value = ""
            else:  # No operator case: ${env.FOO}
                if not env_value:
--- a/llama_stack/providers/inline/scoring/braintrust/config.py
+++ b/llama_stack/providers/inline/scoring/braintrust/config.py
@ -17,5 +17,5 @@ class BraintrustScoringConfig(BaseModel):
    @classmethod
    def sample_run_config(cls, **kwargs) -> dict[str, Any]:
        return {
-            "openai_api_key": "${env.OPENAI_API_KEY:+}",
+            "openai_api_key": "${env.OPENAI_API_KEY:=}",
        }
--- a/llama_stack/providers/inline/vector_io/milvus/config.py
+++ b/llama_stack/providers/inline/vector_io/milvus/config.py
@ -23,9 +23,9 @@ class MilvusVectorIOConfig(BaseModel):
    @classmethod
    def sample_run_config(cls, __distro_dir__: str, **kwargs: Any) -> dict[str, Any]:
        return {
-            "db_path": f"${{env.MILVUS_DB_PATH:={__distro_dir__}/milvus.db}}",
+            "db_path": "${env.MILVUS_DB_PATH:=" + __distro_dir__ + "}/" + "milvus.db",
            "kvstore": SqliteKVStoreConfig.sample_run_config(
                __distro_dir__=__distro_dir__,
-                db_name=f"${{env.MILVUS_KVSTORE_DB_PATH:={__distro_dir__}/milvus_registry.db}}",
+                db_name="milvus_registry.db",
            ),
        }
--- a/llama_stack/providers/registry/vector_io.py
+++ b/llama_stack/providers/registry/vector_io.py
@ -520,7 +520,7 @@ Please refer to the inline provider documentation.
            Api.vector_io,
            AdapterSpec(
                adapter_type="milvus",
-                pip_packages=["pymilvus"],
+                pip_packages=["pymilvus[marshmallow<3.13.0]"],
                module="llama_stack.providers.remote.vector_io.milvus",
                config_class="llama_stack.providers.remote.vector_io.milvus.MilvusVectorIOConfig",
                description="""
--- a/llama_stack/providers/remote/datasetio/nvidia/config.py
+++ b/llama_stack/providers/remote/datasetio/nvidia/config.py
@ -54,7 +54,7 @@ class NvidiaDatasetIOConfig(BaseModel):
    @classmethod
    def sample_run_config(cls, **kwargs) -> dict[str, Any]:
        return {
-            "api_key": "${env.NVIDIA_API_KEY:+}",
+            "api_key": "${env.NVIDIA_API_KEY:=}",
            "dataset_namespace": "${env.NVIDIA_DATASET_NAMESPACE:=default}",
            "project_id": "${env.NVIDIA_PROJECT_ID:=test-project}",
            "datasets_url": "${env.NVIDIA_DATASETS_URL:=http://nemo.test}",
--- a/llama_stack/providers/remote/inference/nvidia/config.py
+++ b/llama_stack/providers/remote/inference/nvidia/config.py
@ -40,7 +40,7 @@ class NVIDIAConfig(BaseModel):
        description="A base url for accessing the NVIDIA NIM",
    )
    api_key: SecretStr | None = Field(
-        default_factory=lambda: os.getenv("NVIDIA_API_KEY"),
+        default_factory=lambda: SecretStr(os.getenv("NVIDIA_API_KEY")),
        description="The NVIDIA API key, only needed of using the hosted service",
    )
    timeout: int = Field(
@ -53,9 +53,15 @@ class NVIDIAConfig(BaseModel):
    )

    @classmethod
-    def sample_run_config(cls, **kwargs) -> dict[str, Any]:
+    def sample_run_config(
+        cls,
+        url: str = "${env.NVIDIA_BASE_URL:=https://integrate.api.nvidia.com}",
+        api_key: str = "${env.NVIDIA_API_KEY:=}",
+        append_api_version: bool = "${env.NVIDIA_APPEND_API_VERSION:=True}",
+        **kwargs,
+    ) -> dict[str, Any]:
        return {
-            "url": "${env.NVIDIA_BASE_URL:=https://integrate.api.nvidia.com}",
-            "api_key": "${env.NVIDIA_API_KEY:+}",
-            "append_api_version": "${env.NVIDIA_APPEND_API_VERSION:=True}",
+            "url": url,
+            "api_key": api_key,
+            "append_api_version": append_api_version,
        }
--- a/llama_stack/providers/remote/inference/ollama/ollama.py
+++ b/llama_stack/providers/remote/inference/ollama/ollama.py
@ -5,6 +5,7 @@
 # the root directory of this source tree.


+import base64
 import uuid
 from collections.abc import AsyncGenerator, AsyncIterator
 from typing import Any
@ -77,6 +78,7 @@ from llama_stack.providers.utils.inference.prompt_adapter import (
    content_has_media,
    convert_image_content_to_url,
    interleaved_content_as_str,
+    localize_image_content,
    request_has_media,
 )

@ -496,6 +498,21 @@ class OllamaInferenceAdapter(
        user: str | None = None,
    ) -> OpenAIChatCompletion | AsyncIterator[OpenAIChatCompletionChunk]:
        model_obj = await self._get_model(model)
+
+        # Ollama does not support image urls, so we need to download the image and convert it to base64
+        async def _convert_message(m: OpenAIMessageParam) -> OpenAIMessageParam:
+            if isinstance(m.content, list):
+                for c in m.content:
+                    if c.type == "image_url" and c.image_url and c.image_url.url:
+                        localize_result = await localize_image_content(c.image_url.url)
+                        if localize_result is None:
+                            raise ValueError(f"Failed to localize image content from {c.image_url.url}")
+
+                        content, format = localize_result
+                        c.image_url.url = f"data:image/{format};base64,{base64.b64encode(content).decode('utf-8')}"
+            return m
+
+        messages = [await _convert_message(m) for m in messages]
        params = await prepare_openai_completion_params(
            model=model_obj.provider_resource_id,
            messages=messages,
--- a/llama_stack/providers/remote/inference/runpod/config.py
+++ b/llama_stack/providers/remote/inference/runpod/config.py
@ -25,6 +25,6 @@ class RunpodImplConfig(BaseModel):
    @classmethod
    def sample_run_config(cls, **kwargs: Any) -> dict[str, Any]:
        return {
-            "url": "${env.RUNPOD_URL:+}",
-            "api_token": "${env.RUNPOD_API_TOKEN:+}",
+            "url": "${env.RUNPOD_URL:=}",
+            "api_token": "${env.RUNPOD_API_TOKEN:=}",
        }
--- a/llama_stack/providers/remote/inference/together/config.py
+++ b/llama_stack/providers/remote/inference/together/config.py
@ -26,5 +26,5 @@ class TogetherImplConfig(BaseModel):
    def sample_run_config(cls, **kwargs) -> dict[str, Any]:
        return {
            "url": "https://api.together.xyz/v1",
-            "api_key": "${env.TOGETHER_API_KEY:+}",
+            "api_key": "${env.TOGETHER_API_KEY:=}",
        }
--- a/llama_stack/providers/remote/inference/watsonx/config.py
+++ b/llama_stack/providers/remote/inference/watsonx/config.py
@ -41,6 +41,6 @@ class WatsonXConfig(BaseModel):
    def sample_run_config(cls, **kwargs) -> dict[str, Any]:
        return {
            "url": "${env.WATSONX_BASE_URL:=https://us-south.ml.cloud.ibm.com}",
-            "api_key": "${env.WATSONX_API_KEY:+}",
-            "project_id": "${env.WATSONX_PROJECT_ID:+}",
+            "api_key": "${env.WATSONX_API_KEY:=}",
+            "project_id": "${env.WATSONX_PROJECT_ID:=}",
        }
--- a/llama_stack/providers/remote/post_training/nvidia/config.py
+++ b/llama_stack/providers/remote/post_training/nvidia/config.py
@ -55,7 +55,7 @@ class NvidiaPostTrainingConfig(BaseModel):
    @classmethod
    def sample_run_config(cls, **kwargs) -> dict[str, Any]:
        return {
-            "api_key": "${env.NVIDIA_API_KEY:+}",
+            "api_key": "${env.NVIDIA_API_KEY:=}",
            "dataset_namespace": "${env.NVIDIA_DATASET_NAMESPACE:=default}",
            "project_id": "${env.NVIDIA_PROJECT_ID:=test-project}",
            "customizer_url": "${env.NVIDIA_CUSTOMIZER_URL:=http://nemo.test}",
--- a/llama_stack/providers/remote/tool_runtime/brave_search/config.py
+++ b/llama_stack/providers/remote/tool_runtime/brave_search/config.py
@ -22,6 +22,6 @@ class BraveSearchToolConfig(BaseModel):
    @classmethod
    def sample_run_config(cls, __distro_dir__: str) -> dict[str, Any]:
        return {
-            "api_key": "${env.BRAVE_SEARCH_API_KEY:+}",
+            "api_key": "${env.BRAVE_SEARCH_API_KEY:=}",
            "max_results": 3,
        }
--- a/llama_stack/providers/remote/tool_runtime/tavily_search/config.py
+++ b/llama_stack/providers/remote/tool_runtime/tavily_search/config.py
@ -22,6 +22,6 @@ class TavilySearchToolConfig(BaseModel):
    @classmethod
    def sample_run_config(cls, __distro_dir__: str) -> dict[str, Any]:
        return {
-            "api_key": "${env.TAVILY_SEARCH_API_KEY:+}",
+            "api_key": "${env.TAVILY_SEARCH_API_KEY:=}",
            "max_results": 3,
        }
--- a/llama_stack/providers/remote/tool_runtime/wolfram_alpha/config.py
+++ b/llama_stack/providers/remote/tool_runtime/wolfram_alpha/config.py
@ -17,5 +17,5 @@ class WolframAlphaToolConfig(BaseModel):
    @classmethod
    def sample_run_config(cls, __distro_dir__: str, **kwargs: Any) -> dict[str, Any]:
        return {
-            "api_key": "${env.WOLFRAM_ALPHA_API_KEY:+}",
+            "api_key": "${env.WOLFRAM_ALPHA_API_KEY:=}",
        }
--- a/llama_stack/providers/utils/inference/prompt_adapter.py
+++ b/llama_stack/providers/utils/inference/prompt_adapter.py
@ -180,11 +180,10 @@ def request_has_media(request: ChatCompletionRequest | CompletionRequest):
        return content_has_media(request.content)


-async def localize_image_content(media: ImageContentItem) -> tuple[bytes, str]:
-    image = media.image
-    if image.url and image.url.uri.startswith("http"):
+async def localize_image_content(uri: str) -> tuple[bytes, str] | None:
+    if uri.startswith("http"):
        async with httpx.AsyncClient() as client:
-            r = await client.get(image.url.uri)
+            r = await client.get(uri)
            content = r.content
            content_type = r.headers.get("content-type")
            if content_type:
@ -194,11 +193,7 @@ async def localize_image_content(media: ImageContentItem) -> tuple[bytes, str]:

        return content, format
    else:
-        # data is a base64 encoded string, decode it to bytes first
-        # TODO(mf): do this more efficiently, decode less
-        data_bytes = base64.b64decode(image.data)
-        pil_image = PIL_Image.open(io.BytesIO(data_bytes))
-        return data_bytes, pil_image.format
+        return None


 async def convert_image_content_to_url(
@ -208,7 +203,18 @@ async def convert_image_content_to_url(
    if image.url and (not download or image.url.uri.startswith("data")):
        return image.url.uri

-    content, format = await localize_image_content(media)
+    if image.data:
+        # data is a base64 encoded string, decode it to bytes first
+        # TODO(mf): do this more efficiently, decode less
+        content = base64.b64decode(image.data)
+        pil_image = PIL_Image.open(io.BytesIO(content))
+        format = pil_image.format
+    else:
+        localize_result = await localize_image_content(image.url.uri)
+        if localize_result is None:
+            raise ValueError(f"Failed to localize image content from {image.url.uri}")
+        content, format = localize_result
+
    if include_format:
        return f"data:image/{format};base64," + base64.b64encode(content).decode("utf-8")
    else:
--- a/llama_stack/providers/utils/vector_io/chunk_utils.py
+++ b/llama_stack/providers/utils/vector_io/chunk_utils.py
@ -9,6 +9,11 @@ import uuid


 def generate_chunk_id(document_id: str, chunk_text: str) -> str:
-    """Generate a unique chunk ID using a hash of document ID and chunk text."""
+    """
+    Generate a unique chunk ID using a hash of the document ID and chunk text.
+
+    Note: MD5 is used only to calculate an identifier, not for security purposes.
+    Adding usedforsecurity=False for compatibility with FIPS environments.
+    """
    hash_input = f"{document_id}:{chunk_text}".encode()
-    return str(uuid.UUID(hashlib.md5(hash_input).hexdigest()))
+    return str(uuid.UUID(hashlib.md5(hash_input, usedforsecurity=False).hexdigest()))
--- a/llama_stack/templates/bedrock/run.yaml
+++ b/llama_stack/templates/bedrock/run.yaml
@ -78,17 +78,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/cerebras/run.yaml
+++ b/llama_stack/templates/cerebras/run.yaml
@ -77,7 +77,7 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  telemetry:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
@ -89,12 +89,12 @@ providers:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/ci-tests/run.yaml
+++ b/llama_stack/templates/ci-tests/run.yaml
@ -81,17 +81,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/dell/run-with-safety.yaml
+++ b/llama_stack/templates/dell/run-with-safety.yaml
@ -84,17 +84,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/dell/run.yaml
+++ b/llama_stack/templates/dell/run.yaml
@ -80,17 +80,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/fireworks/run-with-safety.yaml
+++ b/llama_stack/templates/fireworks/run-with-safety.yaml
@ -90,7 +90,7 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  files:
  - provider_id: meta-reference-files
    provider_type: inline::localfs
@ -103,17 +103,17 @@ providers:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: wolfram-alpha
    provider_type: remote::wolfram-alpha
    config:
-      api_key: ${env.WOLFRAM_ALPHA_API_KEY:+}
+      api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
    config: {}
--- a/llama_stack/templates/fireworks/run.yaml
+++ b/llama_stack/templates/fireworks/run.yaml
@ -85,7 +85,7 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  files:
  - provider_id: meta-reference-files
    provider_type: inline::localfs
@ -98,17 +98,17 @@ providers:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: wolfram-alpha
    provider_type: remote::wolfram-alpha
    config:
-      api_key: ${env.WOLFRAM_ALPHA_API_KEY:+}
+      api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
    config: {}
--- a/llama_stack/templates/groq/run.yaml
+++ b/llama_stack/templates/groq/run.yaml
@ -84,17 +84,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/hf-endpoint/run-with-safety.yaml
+++ b/llama_stack/templates/hf-endpoint/run-with-safety.yaml
@ -89,17 +89,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/hf-endpoint/run.yaml
+++ b/llama_stack/templates/hf-endpoint/run.yaml
@ -84,17 +84,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/hf-serverless/run-with-safety.yaml
+++ b/llama_stack/templates/hf-serverless/run-with-safety.yaml
@ -89,17 +89,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/hf-serverless/run.yaml
+++ b/llama_stack/templates/hf-serverless/run.yaml
@ -84,17 +84,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/llama_api/llama_api.py
+++ b/llama_stack/templates/llama_api/llama_api.py
@ -41,7 +41,7 @@ def get_inference_providers() -> tuple[list[Provider], list[ModelInput]]:
        (
            "llama-openai-compat",
            LLLAMA_MODEL_ENTRIES,
-            LlamaCompatConfig.sample_run_config(api_key="${env.LLAMA_API_KEY:+}"),
+            LlamaCompatConfig.sample_run_config(api_key="${env.LLAMA_API_KEY:=}"),
        ),
    ]
    inference_providers = []
@ -87,15 +87,15 @@ def get_distribution_template() -> DistributionTemplate:
        Provider(
            provider_id="${env.ENABLE_CHROMADB:+chromadb}",
            provider_type="remote::chromadb",
-            config=ChromaVectorIOConfig.sample_run_config(url="${env.CHROMADB_URL:+}"),
+            config=ChromaVectorIOConfig.sample_run_config(url="${env.CHROMADB_URL:=}"),
        ),
        Provider(
            provider_id="${env.ENABLE_PGVECTOR:+pgvector}",
            provider_type="remote::pgvector",
            config=PGVectorVectorIOConfig.sample_run_config(
-                db="${env.PGVECTOR_DB:+}",
-                user="${env.PGVECTOR_USER:+}",
-                password="${env.PGVECTOR_PASSWORD:+}",
+                db="${env.PGVECTOR_DB:=}",
+                user="${env.PGVECTOR_USER:=}",
+                password="${env.PGVECTOR_PASSWORD:=}",
            ),
        ),
    ]
--- a/llama_stack/templates/llama_api/run.yaml
+++ b/llama_stack/templates/llama_api/run.yaml
@ -16,7 +16,7 @@ providers:
    provider_type: remote::llama-openai-compat
    config:
      openai_compat_api_base: https://api.llama.com/compat/v1/
-      api_key: ${env.LLAMA_API_KEY:+}
+      api_key: ${env.LLAMA_API_KEY:=}
  - provider_id: sentence-transformers
    provider_type: inline::sentence-transformers
    config: {}
@ -28,15 +28,15 @@ providers:
  - provider_id: ${env.ENABLE_CHROMADB:+chromadb}
    provider_type: remote::chromadb
    config:
-      url: ${env.CHROMADB_URL:+}
+      url: ${env.CHROMADB_URL:=}
  - provider_id: ${env.ENABLE_PGVECTOR:+pgvector}
    provider_type: remote::pgvector
    config:
      host: ${env.PGVECTOR_HOST:=localhost}
      port: ${env.PGVECTOR_PORT:=5432}
-      db: ${env.PGVECTOR_DB:+}
-      user: ${env.PGVECTOR_USER:+}
-      password: ${env.PGVECTOR_PASSWORD:+}
+      db: ${env.PGVECTOR_DB:=}
+      user: ${env.PGVECTOR_USER:=}
+      password: ${env.PGVECTOR_PASSWORD:=}
  safety:
  - provider_id: llama-guard
    provider_type: inline::llama-guard
@ -93,17 +93,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/meta-reference-gpu/run-with-safety.yaml
+++ b/llama_stack/templates/meta-reference-gpu/run-with-safety.yaml
@ -99,17 +99,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/meta-reference-gpu/run.yaml
+++ b/llama_stack/templates/meta-reference-gpu/run.yaml
@ -89,17 +89,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/nvidia/run-with-safety.yaml
+++ b/llama_stack/templates/nvidia/run-with-safety.yaml
@ -17,7 +17,7 @@ providers:
    provider_type: remote::nvidia
    config:
      url: ${env.NVIDIA_BASE_URL:=https://integrate.api.nvidia.com}
-      api_key: ${env.NVIDIA_API_KEY:+}
+      api_key: ${env.NVIDIA_API_KEY:=}
      append_api_version: ${env.NVIDIA_APPEND_API_VERSION:=True}
  - provider_id: nvidia
    provider_type: remote::nvidia
@ -65,7 +65,7 @@ providers:
  - provider_id: nvidia
    provider_type: remote::nvidia
    config:
-      api_key: ${env.NVIDIA_API_KEY:+}
+      api_key: ${env.NVIDIA_API_KEY:=}
      dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
      project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
      customizer_url: ${env.NVIDIA_CUSTOMIZER_URL:=http://nemo.test}
@ -80,7 +80,7 @@ providers:
  - provider_id: nvidia
    provider_type: remote::nvidia
    config:
-      api_key: ${env.NVIDIA_API_KEY:+}
+      api_key: ${env.NVIDIA_API_KEY:=}
      dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
      project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
      datasets_url: ${env.NVIDIA_DATASETS_URL:=http://nemo.test}
--- a/llama_stack/templates/nvidia/run.yaml
+++ b/llama_stack/templates/nvidia/run.yaml
@ -17,7 +17,7 @@ providers:
    provider_type: remote::nvidia
    config:
      url: ${env.NVIDIA_BASE_URL:=https://integrate.api.nvidia.com}
-      api_key: ${env.NVIDIA_API_KEY:+}
+      api_key: ${env.NVIDIA_API_KEY:=}
      append_api_version: ${env.NVIDIA_APPEND_API_VERSION:=True}
  vector_io:
  - provider_id: faiss
@ -60,7 +60,7 @@ providers:
  - provider_id: nvidia
    provider_type: remote::nvidia
    config:
-      api_key: ${env.NVIDIA_API_KEY:+}
+      api_key: ${env.NVIDIA_API_KEY:=}
      dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
      project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
      customizer_url: ${env.NVIDIA_CUSTOMIZER_URL:=http://nemo.test}
@ -68,7 +68,7 @@ providers:
  - provider_id: nvidia
    provider_type: remote::nvidia
    config:
-      api_key: ${env.NVIDIA_API_KEY:+}
+      api_key: ${env.NVIDIA_API_KEY:=}
      dataset_namespace: ${env.NVIDIA_DATASET_NAMESPACE:=default}
      project_id: ${env.NVIDIA_PROJECT_ID:=test-project}
      datasets_url: ${env.NVIDIA_DATASETS_URL:=http://nemo.test}
--- a/llama_stack/templates/ollama/run-with-safety.yaml
+++ b/llama_stack/templates/ollama/run-with-safety.yaml
@ -85,7 +85,7 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  files:
  - provider_id: meta-reference-files
    provider_type: inline::localfs
@ -105,12 +105,12 @@ providers:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
@ -121,7 +121,7 @@ providers:
  - provider_id: wolfram-alpha
    provider_type: remote::wolfram-alpha
    config:
-      api_key: ${env.WOLFRAM_ALPHA_API_KEY:+}
+      api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}
 metadata_store:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/ollama}/registry.db
--- a/llama_stack/templates/ollama/run.yaml
+++ b/llama_stack/templates/ollama/run.yaml
@ -83,7 +83,7 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  files:
  - provider_id: meta-reference-files
    provider_type: inline::localfs
@ -103,12 +103,12 @@ providers:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
@ -119,7 +119,7 @@ providers:
  - provider_id: wolfram-alpha
    provider_type: remote::wolfram-alpha
    config:
-      api_key: ${env.WOLFRAM_ALPHA_API_KEY:+}
+      api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}
 metadata_store:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/ollama}/registry.db
--- a/llama_stack/templates/open-benchmark/open_benchmark.py
+++ b/llama_stack/templates/open-benchmark/open_benchmark.py
@ -46,7 +46,7 @@ def get_inference_providers() -> tuple[list[Provider], dict[str, list[ProviderMo
                    model_type=ModelType.llm,
                )
            ],
-            OpenAIConfig.sample_run_config(api_key="${env.OPENAI_API_KEY:+}"),
+            OpenAIConfig.sample_run_config(api_key="${env.OPENAI_API_KEY:=}"),
        ),
        (
            "anthropic",
@ -56,7 +56,7 @@ def get_inference_providers() -> tuple[list[Provider], dict[str, list[ProviderMo
                    model_type=ModelType.llm,
                )
            ],
-            AnthropicConfig.sample_run_config(api_key="${env.ANTHROPIC_API_KEY:+}"),
+            AnthropicConfig.sample_run_config(api_key="${env.ANTHROPIC_API_KEY:=}"),
        ),
        (
            "gemini",
@ -66,17 +66,17 @@ def get_inference_providers() -> tuple[list[Provider], dict[str, list[ProviderMo
                    model_type=ModelType.llm,
                )
            ],
-            GeminiConfig.sample_run_config(api_key="${env.GEMINI_API_KEY:+}"),
+            GeminiConfig.sample_run_config(api_key="${env.GEMINI_API_KEY:=}"),
        ),
        (
            "groq",
            [],
-            GroqConfig.sample_run_config(api_key="${env.GROQ_API_KEY:+}"),
+            GroqConfig.sample_run_config(api_key="${env.GROQ_API_KEY:=}"),
        ),
        (
            "together",
            [],
-            TogetherImplConfig.sample_run_config(api_key="${env.TOGETHER_API_KEY:+}"),
+            TogetherImplConfig.sample_run_config(api_key="${env.TOGETHER_API_KEY:=}"),
        ),
    ]
    inference_providers = []
@ -122,15 +122,15 @@ def get_distribution_template() -> DistributionTemplate:
        Provider(
            provider_id="${env.ENABLE_CHROMADB:+chromadb}",
            provider_type="remote::chromadb",
-            config=ChromaVectorIOConfig.sample_run_config(url="${env.CHROMADB_URL:+}"),
+            config=ChromaVectorIOConfig.sample_run_config(url="${env.CHROMADB_URL:=}"),
        ),
        Provider(
            provider_id="${env.ENABLE_PGVECTOR:+pgvector}",
            provider_type="remote::pgvector",
            config=PGVectorVectorIOConfig.sample_run_config(
-                db="${env.PGVECTOR_DB:+}",
-                user="${env.PGVECTOR_USER:+}",
-                password="${env.PGVECTOR_PASSWORD:+}",
+                db="${env.PGVECTOR_DB:=}",
+                user="${env.PGVECTOR_USER:=}",
+                password="${env.PGVECTOR_PASSWORD:=}",
            ),
        ),
    ]
--- a/llama_stack/templates/open-benchmark/run.yaml
+++ b/llama_stack/templates/open-benchmark/run.yaml
@ -15,25 +15,25 @@ providers:
  - provider_id: openai
    provider_type: remote::openai
    config:
-      api_key: ${env.OPENAI_API_KEY:+}
+      api_key: ${env.OPENAI_API_KEY:=}
  - provider_id: anthropic
    provider_type: remote::anthropic
    config:
-      api_key: ${env.ANTHROPIC_API_KEY:+}
+      api_key: ${env.ANTHROPIC_API_KEY:=}
  - provider_id: gemini
    provider_type: remote::gemini
    config:
-      api_key: ${env.GEMINI_API_KEY:+}
+      api_key: ${env.GEMINI_API_KEY:=}
  - provider_id: groq
    provider_type: remote::groq
    config:
      url: https://api.groq.com
-      api_key: ${env.GROQ_API_KEY:+}
+      api_key: ${env.GROQ_API_KEY:=}
  - provider_id: together
    provider_type: remote::together
    config:
      url: https://api.together.xyz/v1
-      api_key: ${env.TOGETHER_API_KEY:+}
+      api_key: ${env.TOGETHER_API_KEY:=}
  vector_io:
  - provider_id: sqlite-vec
    provider_type: inline::sqlite-vec
@ -42,15 +42,15 @@ providers:
  - provider_id: ${env.ENABLE_CHROMADB:+chromadb}
    provider_type: remote::chromadb
    config:
-      url: ${env.CHROMADB_URL:+}
+      url: ${env.CHROMADB_URL:=}
  - provider_id: ${env.ENABLE_PGVECTOR:+pgvector}
    provider_type: remote::pgvector
    config:
      host: ${env.PGVECTOR_HOST:=localhost}
      port: ${env.PGVECTOR_PORT:=5432}
-      db: ${env.PGVECTOR_DB:+}
-      user: ${env.PGVECTOR_USER:+}
-      password: ${env.PGVECTOR_PASSWORD:+}
+      db: ${env.PGVECTOR_DB:=}
+      user: ${env.PGVECTOR_USER:=}
+      password: ${env.PGVECTOR_PASSWORD:=}
  safety:
  - provider_id: llama-guard
    provider_type: inline::llama-guard
@ -107,17 +107,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/passthrough/run-with-safety.yaml
+++ b/llama_stack/templates/passthrough/run-with-safety.yaml
@ -89,22 +89,22 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: wolfram-alpha
    provider_type: remote::wolfram-alpha
    config:
-      api_key: ${env.WOLFRAM_ALPHA_API_KEY:+}
+      api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
    config: {}
--- a/llama_stack/templates/passthrough/run.yaml
+++ b/llama_stack/templates/passthrough/run.yaml
@ -84,22 +84,22 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: wolfram-alpha
    provider_type: remote::wolfram-alpha
    config:
-      api_key: ${env.WOLFRAM_ALPHA_API_KEY:+}
+      api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
    config: {}
--- a/llama_stack/templates/postgres-demo/postgres_demo.py
+++ b/llama_stack/templates/postgres-demo/postgres_demo.py
@ -52,7 +52,7 @@ def get_distribution_template() -> DistributionTemplate:
        Provider(
            provider_id="${env.ENABLE_CHROMADB:+chromadb}",
            provider_type="remote::chromadb",
-            config=ChromaVectorIOConfig.sample_run_config(url="${env.CHROMADB_URL:+}"),
+            config=ChromaVectorIOConfig.sample_run_config(url="${env.CHROMADB_URL:=}"),
        ),
    ]
    default_tool_groups = [
@ -114,7 +114,7 @@ def get_distribution_template() -> DistributionTemplate:
                            provider_id="meta-reference",
                            provider_type="inline::meta-reference",
                            config=dict(
-                                service_name="${env.OTEL_SERVICE_NAME:+}",
+                                service_name="${env.OTEL_SERVICE_NAME:=}",
                                sinks="${env.TELEMETRY_SINKS:=console,otel_trace}",
                                otel_trace_endpoint="${env.OTEL_TRACE_ENDPOINT:=http://localhost:4318/v1/traces}",
                            ),
--- a/llama_stack/templates/postgres-demo/run.yaml
+++ b/llama_stack/templates/postgres-demo/run.yaml
@ -23,7 +23,7 @@ providers:
  - provider_id: ${env.ENABLE_CHROMADB:+chromadb}
    provider_type: remote::chromadb
    config:
-      url: ${env.CHROMADB_URL:+}
+      url: ${env.CHROMADB_URL:=}
  safety:
  - provider_id: llama-guard
    provider_type: inline::llama-guard
@ -51,19 +51,19 @@ providers:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config:
-      service_name: ${env.OTEL_SERVICE_NAME:+}
+      service_name: ${env.OTEL_SERVICE_NAME:=}
      sinks: ${env.TELEMETRY_SINKS:=console,otel_trace}
      otel_trace_endpoint: ${env.OTEL_TRACE_ENDPOINT:=http://localhost:4318/v1/traces}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/remote-vllm/run-with-safety.yaml
+++ b/llama_stack/templates/remote-vllm/run-with-safety.yaml
@ -86,7 +86,7 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  telemetry:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
@ -98,12 +98,12 @@ providers:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
@ -114,7 +114,7 @@ providers:
  - provider_id: wolfram-alpha
    provider_type: remote::wolfram-alpha
    config:
-      api_key: ${env.WOLFRAM_ALPHA_API_KEY:+}
+      api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}
 metadata_store:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/remote-vllm}/registry.db
--- a/llama_stack/templates/remote-vllm/run.yaml
+++ b/llama_stack/templates/remote-vllm/run.yaml
@ -79,7 +79,7 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  telemetry:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
@ -91,12 +91,12 @@ providers:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
@ -107,7 +107,7 @@ providers:
  - provider_id: wolfram-alpha
    provider_type: remote::wolfram-alpha
    config:
-      api_key: ${env.WOLFRAM_ALPHA_API_KEY:+}
+      api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}
 metadata_store:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/remote-vllm}/registry.db
--- a/llama_stack/templates/sambanova/run.yaml
+++ b/llama_stack/templates/sambanova/run.yaml
@ -28,15 +28,15 @@ providers:
  - provider_id: ${env.ENABLE_CHROMADB:+chromadb}
    provider_type: remote::chromadb
    config:
-      url: ${env.CHROMADB_URL:+}
+      url: ${env.CHROMADB_URL:=}
  - provider_id: ${env.ENABLE_PGVECTOR:+pgvector}
    provider_type: remote::pgvector
    config:
      host: ${env.PGVECTOR_HOST:=localhost}
      port: ${env.PGVECTOR_PORT:=5432}
-      db: ${env.PGVECTOR_DB:+}
-      user: ${env.PGVECTOR_USER:+}
-      password: ${env.PGVECTOR_PASSWORD:+}
+      db: ${env.PGVECTOR_DB:=}
+      user: ${env.PGVECTOR_USER:=}
+      password: ${env.PGVECTOR_PASSWORD:=}
  safety:
  - provider_id: sambanova
    provider_type: remote::sambanova
@ -65,12 +65,12 @@ providers:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
@ -81,7 +81,7 @@ providers:
  - provider_id: wolfram-alpha
    provider_type: remote::wolfram-alpha
    config:
-      api_key: ${env.WOLFRAM_ALPHA_API_KEY:+}
+      api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}
 metadata_store:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/sambanova}/registry.db
--- a/llama_stack/templates/sambanova/sambanova.py
+++ b/llama_stack/templates/sambanova/sambanova.py
@ -75,15 +75,15 @@ def get_distribution_template() -> DistributionTemplate:
        Provider(
            provider_id="${env.ENABLE_CHROMADB:+chromadb}",
            provider_type="remote::chromadb",
-            config=ChromaVectorIOConfig.sample_run_config(url="${env.CHROMADB_URL:+}"),
+            config=ChromaVectorIOConfig.sample_run_config(url="${env.CHROMADB_URL:=}"),
        ),
        Provider(
            provider_id="${env.ENABLE_PGVECTOR:+pgvector}",
            provider_type="remote::pgvector",
            config=PGVectorVectorIOConfig.sample_run_config(
-                db="${env.PGVECTOR_DB:+}",
-                user="${env.PGVECTOR_USER:+}",
-                password="${env.PGVECTOR_PASSWORD:+}",
+                db="${env.PGVECTOR_DB:=}",
+                user="${env.PGVECTOR_USER:=}",
+                password="${env.PGVECTOR_PASSWORD:=}",
            ),
        ),
    ]
--- a/llama_stack/templates/starter/run.yaml
+++ b/llama_stack/templates/starter/run.yaml
@ -16,17 +16,17 @@ providers:
  - provider_id: openai
    provider_type: remote::openai
    config:
-      api_key: ${env.OPENAI_API_KEY:+}
+      api_key: ${env.OPENAI_API_KEY:=}
  - provider_id: fireworks
    provider_type: remote::fireworks
    config:
      url: https://api.fireworks.ai/inference/v1
-      api_key: ${env.FIREWORKS_API_KEY:+}
+      api_key: ${env.FIREWORKS_API_KEY:=}
  - provider_id: together
    provider_type: remote::together
    config:
      url: https://api.together.xyz/v1
-      api_key: ${env.TOGETHER_API_KEY:+}
+      api_key: ${env.TOGETHER_API_KEY:=}
  - provider_id: ollama
    provider_type: remote::ollama
    config:
@ -35,21 +35,21 @@ providers:
  - provider_id: anthropic
    provider_type: remote::anthropic
    config:
-      api_key: ${env.ANTHROPIC_API_KEY:+}
+      api_key: ${env.ANTHROPIC_API_KEY:=}
  - provider_id: gemini
    provider_type: remote::gemini
    config:
-      api_key: ${env.GEMINI_API_KEY:+}
+      api_key: ${env.GEMINI_API_KEY:=}
  - provider_id: groq
    provider_type: remote::groq
    config:
      url: https://api.groq.com
-      api_key: ${env.GROQ_API_KEY:+}
+      api_key: ${env.GROQ_API_KEY:=}
  - provider_id: sambanova
    provider_type: remote::sambanova
    config:
      url: https://api.sambanova.ai/v1
-      api_key: ${env.SAMBANOVA_API_KEY:+}
+      api_key: ${env.SAMBANOVA_API_KEY:=}
  - provider_id: vllm
    provider_type: remote::vllm
    config:
@ -75,23 +75,23 @@ providers:
  - provider_id: ${env.ENABLE_MILVUS:+milvus}
    provider_type: inline::milvus
    config:
-      db_path: ${env.MILVUS_DB_PATH:=~/.llama/distributions/starter/milvus.db}
+      db_path: ${env.MILVUS_DB_PATH:=~/.llama/distributions/starter}/milvus.db
      kvstore:
        type: sqlite
        namespace: null
-        db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/starter}/${env.MILVUS_KVSTORE_DB_PATH:=~/.llama/distributions/starter/milvus_registry.db}
+        db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/starter}/milvus_registry.db
  - provider_id: ${env.ENABLE_CHROMADB:+chromadb}
    provider_type: remote::chromadb
    config:
-      url: ${env.CHROMADB_URL:+}
+      url: ${env.CHROMADB_URL:=}
  - provider_id: ${env.ENABLE_PGVECTOR:+pgvector}
    provider_type: remote::pgvector
    config:
      host: ${env.PGVECTOR_HOST:=localhost}
      port: ${env.PGVECTOR_PORT:=5432}
-      db: ${env.PGVECTOR_DB:+}
-      user: ${env.PGVECTOR_USER:+}
-      password: ${env.PGVECTOR_PASSWORD:+}
+      db: ${env.PGVECTOR_DB:=}
+      user: ${env.PGVECTOR_USER:=}
+      password: ${env.PGVECTOR_PASSWORD:=}
  files:
  - provider_id: meta-reference-files
    provider_type: inline::localfs
@ -156,17 +156,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/starter/starter.py
+++ b/llama_stack/templates/starter/starter.py
@ -72,17 +72,17 @@ def get_inference_providers() -> tuple[list[Provider], dict[str, list[ProviderMo
        (
            "openai",
            OPENAI_MODEL_ENTRIES,
-            OpenAIConfig.sample_run_config(api_key="${env.OPENAI_API_KEY:+}"),
+            OpenAIConfig.sample_run_config(api_key="${env.OPENAI_API_KEY:=}"),
        ),
        (
            "fireworks",
            FIREWORKS_MODEL_ENTRIES,
-            FireworksImplConfig.sample_run_config(api_key="${env.FIREWORKS_API_KEY:+}"),
+            FireworksImplConfig.sample_run_config(api_key="${env.FIREWORKS_API_KEY:=}"),
        ),
        (
            "together",
            TOGETHER_MODEL_ENTRIES,
-            TogetherImplConfig.sample_run_config(api_key="${env.TOGETHER_API_KEY:+}"),
+            TogetherImplConfig.sample_run_config(api_key="${env.TOGETHER_API_KEY:=}"),
        ),
        (
            "ollama",
@ -106,22 +106,22 @@ def get_inference_providers() -> tuple[list[Provider], dict[str, list[ProviderMo
        (
            "anthropic",
            ANTHROPIC_MODEL_ENTRIES,
-            AnthropicConfig.sample_run_config(api_key="${env.ANTHROPIC_API_KEY:+}"),
+            AnthropicConfig.sample_run_config(api_key="${env.ANTHROPIC_API_KEY:=}"),
        ),
        (
            "gemini",
            GEMINI_MODEL_ENTRIES,
-            GeminiConfig.sample_run_config(api_key="${env.GEMINI_API_KEY:+}"),
+            GeminiConfig.sample_run_config(api_key="${env.GEMINI_API_KEY:=}"),
        ),
        (
            "groq",
            GROQ_MODEL_ENTRIES,
-            GroqConfig.sample_run_config(api_key="${env.GROQ_API_KEY:+}"),
+            GroqConfig.sample_run_config(api_key="${env.GROQ_API_KEY:=}"),
        ),
        (
            "sambanova",
            SAMBANOVA_MODEL_ENTRIES,
-            SambaNovaImplConfig.sample_run_config(api_key="${env.SAMBANOVA_API_KEY:+}"),
+            SambaNovaImplConfig.sample_run_config(api_key="${env.SAMBANOVA_API_KEY:=}"),
        ),
        (
            "vllm",
@ -190,15 +190,15 @@ def get_distribution_template() -> DistributionTemplate:
        Provider(
            provider_id="${env.ENABLE_CHROMADB:+chromadb}",
            provider_type="remote::chromadb",
-            config=ChromaVectorIOConfig.sample_run_config(url="${env.CHROMADB_URL:+}"),
+            config=ChromaVectorIOConfig.sample_run_config(url="${env.CHROMADB_URL:=}"),
        ),
        Provider(
            provider_id="${env.ENABLE_PGVECTOR:+pgvector}",
            provider_type="remote::pgvector",
            config=PGVectorVectorIOConfig.sample_run_config(
-                db="${env.PGVECTOR_DB:+}",
-                user="${env.PGVECTOR_USER:+}",
-                password="${env.PGVECTOR_PASSWORD:+}",
+                db="${env.PGVECTOR_DB:=}",
+                user="${env.PGVECTOR_USER:=}",
+                password="${env.PGVECTOR_PASSWORD:=}",
            ),
        ),
    ]
--- a/llama_stack/templates/tgi/run-with-safety.yaml
+++ b/llama_stack/templates/tgi/run-with-safety.yaml
@ -84,17 +84,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/tgi/run.yaml
+++ b/llama_stack/templates/tgi/run.yaml
@ -83,17 +83,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/together/run-with-safety.yaml
+++ b/llama_stack/templates/together/run-with-safety.yaml
@ -16,7 +16,7 @@ providers:
    provider_type: remote::together
    config:
      url: https://api.together.xyz/v1
-      api_key: ${env.TOGETHER_API_KEY:+}
+      api_key: ${env.TOGETHER_API_KEY:=}
  - provider_id: sentence-transformers
    provider_type: inline::sentence-transformers
    config: {}
@ -89,17 +89,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
@ -110,7 +110,7 @@ providers:
  - provider_id: wolfram-alpha
    provider_type: remote::wolfram-alpha
    config:
-      api_key: ${env.WOLFRAM_ALPHA_API_KEY:+}
+      api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}
 metadata_store:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/together}/registry.db
--- a/llama_stack/templates/together/run.yaml
+++ b/llama_stack/templates/together/run.yaml
@ -16,7 +16,7 @@ providers:
    provider_type: remote::together
    config:
      url: https://api.together.xyz/v1
-      api_key: ${env.TOGETHER_API_KEY:+}
+      api_key: ${env.TOGETHER_API_KEY:=}
  - provider_id: sentence-transformers
    provider_type: inline::sentence-transformers
    config: {}
@ -84,17 +84,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
@ -105,7 +105,7 @@ providers:
  - provider_id: wolfram-alpha
    provider_type: remote::wolfram-alpha
    config:
-      api_key: ${env.WOLFRAM_ALPHA_API_KEY:+}
+      api_key: ${env.WOLFRAM_ALPHA_API_KEY:=}
 metadata_store:
  type: sqlite
  db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/together}/registry.db
--- a/llama_stack/templates/vllm-gpu/run.yaml
+++ b/llama_stack/templates/vllm-gpu/run.yaml
@ -88,17 +88,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/llama_stack/templates/watsonx/run.yaml
+++ b/llama_stack/templates/watsonx/run.yaml
@ -16,8 +16,8 @@ providers:
    provider_type: remote::watsonx
    config:
      url: ${env.WATSONX_BASE_URL:=https://us-south.ml.cloud.ibm.com}
-      api_key: ${env.WATSONX_API_KEY:+}
-      project_id: ${env.WATSONX_PROJECT_ID:+}
+      api_key: ${env.WATSONX_API_KEY:=}
+      project_id: ${env.WATSONX_PROJECT_ID:=}
  - provider_id: sentence-transformers
    provider_type: inline::sentence-transformers
    config: {}
@ -85,17 +85,17 @@ providers:
  - provider_id: braintrust
    provider_type: inline::braintrust
    config:
-      openai_api_key: ${env.OPENAI_API_KEY:+}
+      openai_api_key: ${env.OPENAI_API_KEY:=}
  tool_runtime:
  - provider_id: brave-search
    provider_type: remote::brave-search
    config:
-      api_key: ${env.BRAVE_SEARCH_API_KEY:+}
+      api_key: ${env.BRAVE_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: tavily-search
    provider_type: remote::tavily-search
    config:
-      api_key: ${env.TAVILY_SEARCH_API_KEY:+}
+      api_key: ${env.TAVILY_SEARCH_API_KEY:=}
      max_results: 3
  - provider_id: rag-runtime
    provider_type: inline::rag-runtime
--- a/tests/integration/README.md
+++ b/tests/integration/README.md
@ -9,7 +9,9 @@ pytest --help
 ```

 Here are the most important options:
- `--stack-config`: specify the stack config to use. You have three ways to point to a stack:
+- `--stack-config`: specify the stack config to use. You have four ways to point to a stack:
+  - **`server:<config>`** - automatically start a server with the given config (e.g., `server:fireworks`). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
+  - **`server:<config>:<port>`** - same as above but with a custom port (e.g., `server:together:8322`)
  - a URL which points to a Llama Stack distribution server
  - a template (e.g., `fireworks`, `together`) or a path to a `run.yaml` file
  - a comma-separated list of api=provider pairs, e.g. `inference=fireworks,safety=llama-guard,agents=meta-reference`. This is most useful for testing a single API surface.
@ -26,12 +28,39 @@ Model parameters can be influenced by the following options:
 Each of these are comma-separated lists and can be used to generate multiple parameter combinations. Note that tests will be skipped
 if no model is specified.

-Experimental, under development, options:
- `--record-responses`: record new API responses instead of using cached ones
-
-
 ## Examples

+### Testing against a Server
+
+Run all text inference tests by auto-starting a server with the `fireworks` config:
+
+```bash
+pytest -s -v tests/integration/inference/test_text_inference.py \
+   --stack-config=server:fireworks \
+   --text-model=meta-llama/Llama-3.1-8B-Instruct
+```
+
+Run tests with auto-server startup on a custom port:
+
+```bash
+pytest -s -v tests/integration/inference/ \
+   --stack-config=server:together:8322 \
+   --text-model=meta-llama/Llama-3.1-8B-Instruct
+```
+
+Run multiple test suites with auto-server (eliminates manual server management):
+
+```bash
+# Auto-start server and run all integration tests
+export FIREWORKS_API_KEY=<your_key>
+
+pytest -s -v tests/integration/inference/ tests/integration/safety/ tests/integration/agents/ \
+   --stack-config=server:fireworks \
+   --text-model=meta-llama/Llama-3.1-8B-Instruct
+```
+
+### Testing with Library Client
+
 Run all text inference tests with the `together` distribution:

 ```bash
--- a/tests/integration/fixtures/common.py
+++ b/tests/integration/fixtures/common.py
@ -6,9 +6,13 @@

 import inspect
 import os
+import socket
+import subprocess
 import tempfile
+import time

 import pytest
+import requests
 import yaml
 from llama_stack_client import LlamaStackClient
 from openai import OpenAI
@ -17,6 +21,44 @@ from llama_stack import LlamaStackAsLibraryClient
 from llama_stack.distribution.stack import run_config_from_adhoc_config_spec
 from llama_stack.env import get_env_or_fail

+DEFAULT_PORT = 8321
+
+
+def is_port_available(port: int, host: str = "localhost") -> bool:
+    """Check if a port is available for binding."""
+    try:
+        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
+            sock.bind((host, port))
+            return True
+    except OSError:
+        return False
+
+
+def start_llama_stack_server(config_name: str) -> subprocess.Popen:
+    """Start a llama stack server with the given config."""
+    cmd = ["llama", "stack", "run", config_name]
+
+    # Start server in background
+    process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
+    return process
+
+
+def wait_for_server_ready(base_url: str, timeout: int = 120) -> bool:
+    """Wait for the server to be ready by polling the health endpoint."""
+    health_url = f"{base_url}/v1/health"
+    start_time = time.time()
+
+    while time.time() - start_time < timeout:
+        try:
+            response = requests.get(health_url, timeout=5)
+            if response.status_code == 200:
+                return True
+        except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):
+            pass
+        time.sleep(0.5)
+
+    return False
+

@pytest.fixture(scope="session")
 def provider_data():
@ -122,6 +164,40 @@ def llama_stack_client(request, provider_data):
    if not config:
        raise ValueError("You must specify either --stack-config or LLAMA_STACK_CONFIG")

+    # Handle server:<config_name> format or server:<config_name>:<port>
+    if config.startswith("server:"):
+        parts = config.split(":")
+        config_name = parts[1]
+        port = int(parts[2]) if len(parts) > 2 else int(os.environ.get("LLAMA_STACK_PORT", DEFAULT_PORT))
+        base_url = f"http://localhost:{port}"
+
+        # Check if port is available
+        if is_port_available(port):
+            print(f"Starting llama stack server with config '{config_name}' on port {port}...")
+
+            # Start server
+            server_process = start_llama_stack_server(config_name)
+
+            # Wait for server to be ready
+            if not wait_for_server_ready(base_url, timeout=120):
+                print("Server failed to start within timeout")
+                server_process.terminate()
+                raise RuntimeError(
+                    f"Server failed to start within timeout. Check that config '{config_name}' exists and is valid."
+                )
+
+            print(f"Server is ready at {base_url}")
+
+            # Store process for potential cleanup (pytest will handle termination at session end)
+            request.session._llama_stack_server_process = server_process
+        else:
+            print(f"Port {port} is already in use, assuming server is already running...")
+
+        return LlamaStackClient(
+            base_url=base_url,
+            provider_data=provider_data,
+        )
+
    # check if this looks like a URL
    if config.startswith("http") or "//" in config:
        return LlamaStackClient(
--- a/tests/integration/vector_io/test_vector_io.py
+++ b/tests/integration/vector_io/test_vector_io.py
@ -123,6 +123,9 @@ def test_insert_chunks(client_with_empty_registry, embedding_model_id, embedding


 def test_insert_chunks_with_precomputed_embeddings(client_with_empty_registry, embedding_model_id, embedding_dimension):
+    vector_io_provider_params_dict = {
+        "inline::milvus": {"score_threshold": -1.0},
+    }
    vector_db_id = "test_precomputed_embeddings_db"
    client_with_empty_registry.vector_dbs.register(
        vector_db_id=vector_db_id,
@ -133,7 +136,7 @@ def test_insert_chunks_with_precomputed_embeddings(client_with_empty_registry, e
    chunks_with_embeddings = [
        Chunk(
            content="This is a test chunk with precomputed embedding.",
-            metadata={"document_id": "doc1", "source": "precomputed"},
+            metadata={"document_id": "doc1", "source": "precomputed", "chunk_id": "chunk1"},
            embedding=[0.1] * int(embedding_dimension),
        ),
    ]
@ -143,22 +146,29 @@ def test_insert_chunks_with_precomputed_embeddings(client_with_empty_registry, e
        chunks=chunks_with_embeddings,
    )

-    # Query for the first document
+    provider = [p.provider_id for p in client_with_empty_registry.providers.list() if p.api == "vector_io"][0]
    response = client_with_empty_registry.vector_io.query(
        vector_db_id=vector_db_id,
        query="precomputed embedding test",
+        params=vector_io_provider_params_dict.get(provider, None),
    )

    # Verify the top result is the expected document
    assert response is not None
-    assert len(response.chunks) > 0
+    assert len(response.chunks) > 0, (
+        f"provider params for {provider} = {vector_io_provider_params_dict.get(provider, None)}"
+    )
    assert response.chunks[0].metadata["document_id"] == "doc1"
    assert response.chunks[0].metadata["source"] == "precomputed"


+# expect this test to fail
 def test_query_returns_valid_object_when_identical_to_embedding_in_vdb(
    client_with_empty_registry, embedding_model_id, embedding_dimension
 ):
+    vector_io_provider_params_dict = {
+        "inline::milvus": {"score_threshold": 0.0},
+    }
    vector_db_id = "test_precomputed_embeddings_db"
    client_with_empty_registry.vector_dbs.register(
        vector_db_id=vector_db_id,
@ -179,9 +189,11 @@ def test_query_returns_valid_object_when_identical_to_embedding_in_vdb(
        chunks=chunks_with_embeddings,
    )

+    provider = [p.provider_id for p in client_with_empty_registry.providers.list() if p.api == "vector_io"][0]
    response = client_with_empty_registry.vector_io.query(
        vector_db_id=vector_db_id,
        query="duplicate",
+        params=vector_io_provider_params_dict.get(provider, None),
    )

    # Verify the top result is the expected document
--- a/tests/unit/server/test_replace_env_vars.py
+++ b/tests/unit/server/test_replace_env_vars.py
@ -34,6 +34,12 @@ class TestReplaceEnvVars(unittest.TestCase):
    def test_default_value_when_empty(self):
        self.assertEqual(replace_env_vars("${env.EMPTY_VAR:=default}"), "default")

+    def test_none_value_when_empty(self):
+        self.assertEqual(replace_env_vars("${env.EMPTY_VAR:=}"), None)
+
+    def test_value_when_set(self):
+        self.assertEqual(replace_env_vars("${env.TEST_VAR:=}"), "test_value")
+
    def test_empty_var_no_default(self):
        self.assertEqual(replace_env_vars("${env.EMPTY_VAR_NO_DEFAULT:+}"), None)

--- a/tests/verifications/openai_api/fixtures/test_cases/responses.yaml
+++ b/tests/verifications/openai_api/fixtures/test_cases/responses.yaml
@ -8,6 +8,17 @@ test_response_basic:
    - case_id: "saturn"
      input: "Which planet has rings around it with a name starting with letter S?"
      output: "saturn"
+    - case_id: "image_input"
+      input:
+      - role: user
+        content:
+        - type: input_text
+          text: "what teams are playing in this image?"
+      - role: user
+        content:
+        - type: input_image
+          image_url: "https://upload.wikimedia.org/wikipedia/commons/3/3b/LeBron_James_Layup_%28Cleveland_vs_Brooklyn_2018%29.jpg"
+      output: "brooklyn nets"

 test_response_multi_turn:
  test_name: test_response_multi_turn