feat: Podman AI Lab provider and distribution

Signed-off-by: Jeff MAURY <jmaury@redhat.com>
2025-12-31 02:23:52 +00:00 · 2025-03-20 16:09:15 +01:00 · 2025-03-20 16:09:15 +01:00 · dd86427ce3
commit dd86427ce3
parent 45e08ff417
14 changed files with 1131 additions and 0 deletions
--- a/docs/source/distributions/self_hosted_distro/podman-ai-lab.md
+++ b/docs/source/distributions/self_hosted_distro/podman-ai-lab.md
@ -0,0 +1,141 @@
+---
+orphan: true
+---
+<!-- This file was auto-generated by distro_codegen.py, please edit source -->
+# Podman AI Lab Distribution
+
+```{toctree}
+:maxdepth: 2
+:hidden:
+
+self
+```
+
+The `llamastack/distribution-podman-ai-lab` distribution consists of the following provider configurations.
+
+| API | Provider(s) |
+|-----|-------------|
+| agents | `inline::meta-reference` |
+| datasetio | `remote::huggingface`, `inline::localfs` |
+| eval | `inline::meta-reference` |
+| inference | `remote::podman-ai-lab` |
+| safety | `inline::llama-guard` |
+| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
+| telemetry | `inline::meta-reference` |
+| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::code-interpreter`, `inline::rag-runtime`, `remote::model-context-protocol`, `remote::wolfram-alpha` |
+| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
+
+
+You should use this distribution if you have a regular desktop machine without very powerful GPUs. Of course, if you have powerful GPUs, you can still continue using this distribution since Ollama supports GPU acceleration.
+
+### Environment Variables
+
+The following environment variables can be configured:
+
+- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
+- `PODMAN_AI_LAB_URL`: URL of the Podman AI Lab server (default: `http://127.0.0.1:10434`)
+- `SAFETY_MODEL`: Safety model loaded into the Ollama server (default: `meta-llama/Llama-Guard-3-1B`)
+
+
+## Setting up Podman AI Lab server
+
+Please check the [Podman AI Lab Documentation](https://github.com/containers/podman-desktop-extension-ai-lab) on how to install and run Ollama. After installing Ollama, you need to run `ollama serve` to start the server.
+
+
+If you are using Llama Stack Safety / Shield APIs, you will also need to pull and run the safety model.
+
+```bash
+export SAFETY_MODEL="meta-llama/Llama-Guard-3-1B"
+
+# ollama names this model differently, and we must use the ollama name when loading the model
+export PODMAN_AI_LAB_SAFETY_MODEL="llama-guard3:1b"
+```
+
+## Running Llama Stack
+
+Now you are ready to run Llama Stack with Podman AI Lab as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
+
+### Via Docker
+
+This method allows you to get started quickly without having to build the distribution code.
+
+```bash
+export LLAMA_STACK_PORT=5001
+docker run \
+  -it \
+  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
+  -v ~/.llama:/root/.llama \
+  llamastack/distribution-podman-ai-lab \
+  --port $LLAMA_STACK_PORT \
+  --env PODMAN_AI_LAB_URL=http://host.docker.internal:10434
+```
+
+If you are using Llama Stack Safety / Shield APIs, use:
+
+```bash
+# You need a local checkout of llama-stack to run this, get it using
+# git clone https://github.com/meta-llama/llama-stack.git
+cd /path/to/llama-stack
+
+docker run \
+  -it \
+  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
+  -v ~/.llama:/root/.llama \
+  -v ./llama_stack/templates/ollama/run-with-safety.yaml:/root/my-run.yaml \
+  llamastack/distribution-podman-ai-lab \
+  --yaml-config /root/my-run.yaml \
+  --port $LLAMA_STACK_PORT \
+  --env SAFETY_MODEL=$SAFETY_MODEL \
+  --env PODMAN_AI_LAB_URL=http://host.docker.internal:11434
+```
+
+### Via Conda
+
+Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
+
+```bash
+export LLAMA_STACK_PORT=5001
+
+llama stack build --template podman-ai-lab --image-type conda
+llama stack run ./run.yaml \
+  --port $LLAMA_STACK_PORT \
+  --env PODMAN_AI_LAB_URL=http://localhost:10434
+```
+
+If you are using Llama Stack Safety / Shield APIs, use:
+
+```bash
+llama stack run ./run-with-safety.yaml \
+  --port $LLAMA_STACK_PORT \
+  --env SAFETY_MODEL=$SAFETY_MODEL \
+  --env PODMAN_AI_LAB_URL=http://localhost:11434
+```
+
+
+### (Optional) Update Model Serving Configuration
+
+To serve a new model with `Podman AI Lab`:
+- launch Podman Desktop with Podman AI Lab extension installed
+- download the model
+- start an inference server for the model
+
+To make sure that the model is being served correctly, run `curl localhost:10434/api/tags` to get a list of models being served by Podman AI Lab.
+```
+$ curl localhost:10434/api/tags
+{"models":[{"model":"hf.ibm-research.granite-3.2-8b-instruct-GGUF","name":"ibm-research/granite-3.2-8b-instruct-GGUF","digest":"363f0bbc3200b9c9b0ab87efe237d77b1e05bb929d5d7e4b57c1447c911223e8","size":4942859552,"modified_at":"2025-03-17T14:48:32.417Z","details":{}}]}
+```
+
+To verify that the model served by Podman AI Lab is correctly connected to Llama Stack server
+```bash
+$ llama-stack-client models list
+
+Available Models
+
+┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
+┃ model_type   ┃ identifier                                     ┃ provider_resource_id                          ┃ metadata  ┃ provider_id    ┃
+┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
+│ llm          │ ibm-research/granite-3.2-8b-instruct-GGUF      │ ibm-research/granite-3.2-8b-instruct-GGUF     │           │ podman-ai-lab  │
+└──────────────┴────────────────────────────────────────────────┴───────────────────────────────────────────────┴───────────┴────────────────┘
+
+Total models: 1
+```
--- a/llama_stack/providers/registry/inference.py
+++ b/llama_stack/providers/registry/inference.py
@ -77,6 +77,16 @@ def available_providers() -> List[ProviderSpec]:
                module="llama_stack.providers.remote.inference.ollama",
            ),
        ),
+        remote_provider_spec(
+            api=Api.inference,
+            api_dependencies=[Api.models],
+            adapter=AdapterSpec(
+                adapter_type="podman-ai-lab",
+                pip_packages=["ollama", "aiohttp"],
+                config_class="llama_stack.providers.remote.inference.podman_ai_lab.PodmanAILabImplConfig",
+                module="llama_stack.providers.remote.inference.podman_ai_lab",
+            ),
+        ),
        remote_provider_spec(
            api=Api.inference,
            adapter=AdapterSpec(
--- a/llama_stack/providers/remote/inference/podman_ai_lab/init.py
+++ b/llama_stack/providers/remote/inference/podman_ai_lab/init.py
@ -0,0 +1,18 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+from typing import Any, Dict
+
+from llama_stack.apis.datatypes import Api
+
+from .config import PodmanAILabImplConfig
+
+
+async def get_adapter_impl(config: PodmanAILabImplConfig, deps: Dict[Api, Any]):
+    from .podman_ai_lab import PodmanAILabInferenceAdapter
+
+    impl = PodmanAILabInferenceAdapter(config.url, deps[Api.models])
+    await impl.initialize()
+    return impl
--- a/llama_stack/providers/remote/inference/podman_ai_lab/config.py
+++ b/llama_stack/providers/remote/inference/podman_ai_lab/config.py
@ -0,0 +1,21 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from typing import Any, Dict
+
+from pydantic import BaseModel
+
+DEFAULT_PODMAN_AI_LAB_URL = "http://localhost:10434"
+
+
+class PodmanAILabImplConfig(BaseModel):
+    url: str = DEFAULT_PODMAN_AI_LAB_URL
+
+    @classmethod
+    def sample_run_config(
+        cls, url: str = "${env.PODMAN_AI_LAB_URL:http://localhost:10434}", **kwargsi
+    ) -> Dict[str, Any]:
+        return {"url": url}
--- a/llama_stack/providers/remote/inference/podman_ai_lab/podman_ai_lab.py
+++ b/llama_stack/providers/remote/inference/podman_ai_lab/podman_ai_lab.py
@ -0,0 +1,294 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+
+from typing import AsyncGenerator, List, Optional, Union
+
+from ollama import AsyncClient
+
+from llama_stack.apis.common.content_types import (
+    ImageContentItem,
+    InterleavedContent,
+    InterleavedContentItem,
+    TextContentItem,
+)
+from llama_stack.apis.inference import (
+    ChatCompletionRequest,
+    ChatCompletionResponse,
+    CompletionRequest,
+    EmbeddingsResponse,
+    EmbeddingTaskType,
+    Inference,
+    LogProbConfig,
+    Message,
+    ResponseFormat,
+    SamplingParams,
+    TextTruncation,
+    ToolChoice,
+    ToolConfig,
+    ToolDefinition,
+    ToolPromptFormat,
+)
+from llama_stack.apis.models import Model, Models
+from llama_stack.log import get_logger
+from llama_stack.providers.datatypes import ModelsProtocolPrivate
+from llama_stack.providers.utils.inference.openai_compat import (
+    OpenAICompatCompletionChoice,
+    OpenAICompatCompletionResponse,
+    get_sampling_options,
+    process_chat_completion_response,
+    process_chat_completion_stream_response,
+    process_completion_response,
+    process_completion_stream_response,
+)
+from llama_stack.providers.utils.inference.prompt_adapter import (
+    chat_completion_request_to_prompt,
+    completion_request_to_prompt,
+    convert_image_content_to_url,
+    request_has_media,
+)
+
+logger = get_logger(name=__name__, category="inference")
+
+
+class PodmanAILabInferenceAdapter(Inference, ModelsProtocolPrivate):
+    def __init__(self, url: str, models: Models) -> None:
+        self.url = url
+        self.models = models
+
+    @property
+    def client(self) -> AsyncClient:
+        return AsyncClient(host=self.url)
+
+    async def initialize(self) -> None:
+        logger.info(f"checking connectivity to Podman AI Lab at `{self.url}`...")
+        try:
+            await self.client.list()
+            # for model in response["models"]:
+            #    await self.models.register_model(model.model, model.model, 'podman-ai-lab')
+        except ConnectionError as e:
+            raise RuntimeError("Podman AI Lab Server is not running, start it using Podman Desktop") from e
+
+    async def shutdown(self) -> None:
+        pass
+
+    async def unregister_model(self, model_id: str) -> None:
+        pass
+
+    async def completion(
+        self,
+        model_id: str,
+        content: InterleavedContent,
+        sampling_params: Optional[SamplingParams] = None,
+        response_format: Optional[ResponseFormat] = None,
+        stream: Optional[bool] = False,
+        logprobs: Optional[LogProbConfig] = None,
+    ) -> AsyncGenerator:
+        if sampling_params is None:
+            sampling_params = SamplingParams()
+        model = await self.model_store.get_model(model_id)
+        request = CompletionRequest(
+            model=model.provider_resource_id,
+            content=content,
+            sampling_params=sampling_params,
+            response_format=response_format,
+            stream=stream,
+            logprobs=logprobs,
+        )
+        if stream:
+            return self._stream_completion(request)
+        else:
+            return await self._nonstream_completion(request)
+
+    async def _stream_completion(self, request: CompletionRequest) -> AsyncGenerator:
+        params = await self._get_params(request)
+
+        async def _generate_and_convert_to_openai_compat():
+            s = await self.client.generate(**params)
+            async for chunk in s:
+                choice = OpenAICompatCompletionChoice(
+                    finish_reason=chunk["done_reason"] if chunk["done"] else None,
+                    text=chunk["response"],
+                )
+                yield OpenAICompatCompletionResponse(
+                    choices=[choice],
+                )
+
+        stream = _generate_and_convert_to_openai_compat()
+        async for chunk in process_completion_stream_response(stream):
+            yield chunk
+
+    async def _nonstream_completion(self, request: CompletionRequest) -> AsyncGenerator:
+        params = await self._get_params(request)
+        r = await self.client.generate(**params)
+
+        choice = OpenAICompatCompletionChoice(
+            finish_reason=r["done_reason"] if r["done"] else None,
+            text=r["response"],
+        )
+        response = OpenAICompatCompletionResponse(
+            choices=[choice],
+        )
+
+        return process_completion_response(response)
+
+    async def chat_completion(
+        self,
+        model_id: str,
+        messages: List[Message],
+        sampling_params: Optional[SamplingParams] = None,
+        response_format: Optional[ResponseFormat] = None,
+        tools: Optional[List[ToolDefinition]] = None,
+        tool_choice: Optional[ToolChoice] = ToolChoice.auto,
+        tool_prompt_format: Optional[ToolPromptFormat] = None,
+        stream: Optional[bool] = False,
+        logprobs: Optional[LogProbConfig] = None,
+        tool_config: Optional[ToolConfig] = None,
+    ) -> AsyncGenerator:
+        if sampling_params is None:
+            sampling_params = SamplingParams()
+        model = await self.model_store.get_model(model_id)
+        request = ChatCompletionRequest(
+            model=model.provider_resource_id,
+            messages=messages,
+            sampling_params=sampling_params,
+            tools=tools or [],
+            stream=stream,
+            logprobs=logprobs,
+            response_format=response_format,
+            tool_config=tool_config,
+        )
+        if stream:
+            return self._stream_chat_completion(request)
+        else:
+            return await self._nonstream_chat_completion(request)
+
+    async def _get_params(self, request: Union[ChatCompletionRequest, CompletionRequest]) -> dict:
+        sampling_options = get_sampling_options(request.sampling_params)
+        # This is needed since the Ollama API expects num_predict to be set
+        # for early truncation instead of max_tokens.
+        if sampling_options.get("max_tokens") is not None:
+            sampling_options["num_predict"] = sampling_options["max_tokens"]
+
+        input_dict = {}
+        media_present = request_has_media(request)
+        llama_model = self.register_helper.get_llama_model(request.model)
+        if isinstance(request, ChatCompletionRequest):
+            if media_present or not llama_model:
+                contents = [await convert_message_to_openai_dict_for_podman_ai_lab(m) for m in request.messages]
+                # flatten the list of lists
+                input_dict["messages"] = [item for sublist in contents for item in sublist]
+            else:
+                input_dict["raw"] = True
+                input_dict["prompt"] = await chat_completion_request_to_prompt(
+                    request,
+                    llama_model,
+                )
+        else:
+            assert not media_present, "Ollama does not support media for Completion requests"
+            input_dict["prompt"] = await completion_request_to_prompt(request)
+            input_dict["raw"] = True
+
+        if fmt := request.response_format:
+            if fmt.type == "json_schema":
+                input_dict["format"] = fmt.json_schema
+            elif fmt.type == "grammar":
+                raise NotImplementedError("Grammar response format is not supported")
+            else:
+                raise ValueError(f"Unknown response format type: {fmt.type}")
+
+        params = {
+            "model": request.model,
+            **input_dict,
+            "options": sampling_options,
+            "stream": request.stream,
+        }
+        logger.debug(f"params to Podman AI Lab: {params}")
+
+        return params
+
+    async def _nonstream_chat_completion(self, request: ChatCompletionRequest) -> ChatCompletionResponse:
+        params = await self._get_params(request)
+        if "messages" in params:
+            r = await self.client.chat(**params)
+        else:
+            r = await self.client.generate(**params)
+
+        if "message" in r:
+            choice = OpenAICompatCompletionChoice(
+                finish_reason=r["done_reason"] if r["done"] else None,
+                text=r["message"]["content"],
+            )
+        else:
+            choice = OpenAICompatCompletionChoice(
+                finish_reason=r["done_reason"] if r["done"] else None,
+                text=r["response"],
+            )
+        response = OpenAICompatCompletionResponse(
+            choices=[choice],
+        )
+        return process_chat_completion_response(response, request)
+
+    async def _stream_chat_completion(self, request: ChatCompletionRequest) -> AsyncGenerator:
+        params = await self._get_params(request)
+
+        async def _generate_and_convert_to_openai_compat():
+            if "messages" in params:
+                s = await self.client.chat(**params)
+            else:
+                s = await self.client.generate(**params)
+            async for chunk in s:
+                if "message" in chunk:
+                    choice = OpenAICompatCompletionChoice(
+                        finish_reason=chunk["done_reason"] if chunk["done"] else None,
+                        text=chunk["message"]["content"],
+                    )
+                else:
+                    choice = OpenAICompatCompletionChoice(
+                        finish_reason=chunk["done_reason"] if chunk["done"] else None,
+                        text=chunk["response"],
+                    )
+                yield OpenAICompatCompletionResponse(
+                    choices=[choice],
+                )
+
+        stream = _generate_and_convert_to_openai_compat()
+        async for chunk in process_chat_completion_stream_response(stream, request):
+            yield chunk
+
+    async def embeddings(
+        self,
+        model_id: str,
+        contents: List[str] | List[InterleavedContentItem],
+        text_truncation: Optional[TextTruncation] = TextTruncation.none,
+        output_dimension: Optional[int] = None,
+        task_type: Optional[EmbeddingTaskType] = None,
+    ) -> EmbeddingsResponse:
+        raise NotImplementedError("embeddings endpoint is not implemented")
+
+    async def register_model(self, model: Model) -> Model:
+        return model
+
+
+async def convert_message_to_openai_dict_for_podman_ai_lab(message: Message) -> List[dict]:
+    async def _convert_content(content) -> dict:
+        if isinstance(content, ImageContentItem):
+            return {
+                "role": message.role,
+                "images": [await convert_image_content_to_url(content, download=True, include_format=False)],
+            }
+        else:
+            text = content.text if isinstance(content, TextContentItem) else content
+            assert isinstance(text, str)
+            return {
+                "role": message.role,
+                "content": text,
+            }
+
+    if isinstance(message.content, list):
+        return [await _convert_content(c) for c in message.content]
+    else:
+        return [await _convert_content(message.content)]
--- a/llama_stack/templates/dependencies.json
+++ b/llama_stack/templates/dependencies.json
@ -536,6 +536,44 @@
    "sentence-transformers --no-deps",
    "torch torchvision --index-url https://download.pytorch.org/whl/cpu"
  ],
+  "podman-ai-lab": [
+    "aiohttp",
+    "aiosqlite",
+    "autoevals",
+    "blobfile",
+    "chardet",
+    "chromadb-client",
+    "datasets",
+    "emoji",
+    "faiss-cpu",
+    "fastapi",
+    "fire",
+    "httpx",
+    "langdetect",
+    "matplotlib",
+    "mcp",
+    "nltk",
+    "numpy",
+    "ollama",
+    "openai",
+    "opentelemetry-exporter-otlp-proto-http",
+    "opentelemetry-sdk",
+    "pandas",
+    "pillow",
+    "psycopg2-binary",
+    "pymongo",
+    "pypdf",
+    "pythainlp",
+    "redis",
+    "requests",
+    "scikit-learn",
+    "scipy",
+    "sentencepiece",
+    "tqdm",
+    "transformers",
+    "tree_sitter",
+    "uvicorn"
+  ],
  "remote-vllm": [
    "aiosqlite",
    "autoevals",
--- a/llama_stack/templates/podman-ai-lab/init.py
+++ b/llama_stack/templates/podman-ai-lab/init.py
@ -0,0 +1,7 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from .podman_ai_lab import get_distribution_template  # noqa: F401
--- a/llama_stack/templates/podman-ai-lab/build.yaml
+++ b/llama_stack/templates/podman-ai-lab/build.yaml
@ -0,0 +1,33 @@
+version: '2'
+distribution_spec:
+  description: Use (an external) Podman AI Lab server for running LLM inference
+  providers:
+    inference:
+    - remote::podman-ai-lab
+    vector_io:
+    - inline::faiss
+    - remote::chromadb
+    - remote::pgvector
+    safety:
+    - inline::llama-guard
+    agents:
+    - inline::meta-reference
+    telemetry:
+    - inline::meta-reference
+    eval:
+    - inline::meta-reference
+    datasetio:
+    - remote::huggingface
+    - inline::localfs
+    scoring:
+    - inline::basic
+    - inline::llm-as-judge
+    - inline::braintrust
+    tool_runtime:
+    - remote::brave-search
+    - remote::tavily-search
+    - inline::code-interpreter
+    - inline::rag-runtime
+    - remote::model-context-protocol
+    - remote::wolfram-alpha
+image_type: conda
--- a/llama_stack/templates/podman-ai-lab/doc_template.md
+++ b/llama_stack/templates/podman-ai-lab/doc_template.md
@ -0,0 +1,131 @@
+---
+orphan: true
+---
+# Podman AI Lab Distribution
+
+```{toctree}
+:maxdepth: 2
+:hidden:
+
+self
+```
+
+The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations.
+
+{{ providers_table }}
+
+You should use this distribution if you have a regular desktop machine without very powerful GPUs. Of course, if you have powerful GPUs, you can still continue using this distribution since Ollama supports GPU acceleration.
+
+{% if run_config_env_vars %}
+### Environment Variables
+
+The following environment variables can be configured:
+
+{% for var, (default_value, description) in run_config_env_vars.items() %}
+- `{{ var }}`: {{ description }} (default: `{{ default_value }}`)
+{% endfor %}
+{% endif %}
+
+
+## Setting up Podman AI Lab server
+
+Please check the [Podman AI Lab Documentation](https://github.com/containers/podman-desktop-extension-ai-lab) on how to install and run Ollama. After installing Ollama, you need to run `ollama serve` to start the server.
+
+
+If you are using Llama Stack Safety / Shield APIs, you will also need to pull and run the safety model.
+
+```bash
+export SAFETY_MODEL="meta-llama/Llama-Guard-3-1B"
+
+# ollama names this model differently, and we must use the ollama name when loading the model
+export PODMAN_AI_LAB_SAFETY_MODEL="llama-guard3:1b"
+```
+
+## Running Llama Stack
+
+Now you are ready to run Llama Stack with Podman AI Lab as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
+
+### Via Docker
+
+This method allows you to get started quickly without having to build the distribution code.
+
+```bash
+export LLAMA_STACK_PORT=5001
+docker run \
+  -it \
+  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
+  -v ~/.llama:/root/.llama \
+  llamastack/distribution-{{ name }} \
+  --port $LLAMA_STACK_PORT \
+  --env PODMAN_AI_LAB_URL=http://host.docker.internal:10434
+```
+
+If you are using Llama Stack Safety / Shield APIs, use:
+
+```bash
+# You need a local checkout of llama-stack to run this, get it using
+# git clone https://github.com/meta-llama/llama-stack.git
+cd /path/to/llama-stack
+
+docker run \
+  -it \
+  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
+  -v ~/.llama:/root/.llama \
+  -v ./llama_stack/templates/ollama/run-with-safety.yaml:/root/my-run.yaml \
+  llamastack/distribution-{{ name }} \
+  --yaml-config /root/my-run.yaml \
+  --port $LLAMA_STACK_PORT \
+  --env SAFETY_MODEL=$SAFETY_MODEL \
+  --env PODMAN_AI_LAB_URL=http://host.docker.internal:11434
+```
+
+### Via Conda
+
+Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
+
+```bash
+export LLAMA_STACK_PORT=5001
+
+llama stack build --template {{ name }} --image-type conda
+llama stack run ./run.yaml \
+  --port $LLAMA_STACK_PORT \
+  --env PODMAN_AI_LAB_URL=http://localhost:10434
+```
+
+If you are using Llama Stack Safety / Shield APIs, use:
+
+```bash
+llama stack run ./run-with-safety.yaml \
+  --port $LLAMA_STACK_PORT \
+  --env SAFETY_MODEL=$SAFETY_MODEL \
+  --env PODMAN_AI_LAB_URL=http://localhost:11434
+```
+
+
+### (Optional) Update Model Serving Configuration
+
+To serve a new model with `Podman AI Lab`:
+- launch Podman Desktop with Podman AI Lab extension installed
+- download the model
+- start an inference server for the model
+
+To make sure that the model is being served correctly, run `curl localhost:10434/api/tags` to get a list of models being served by Podman AI Lab.
+```
+$ curl localhost:10434/api/tags
+{"models":[{"model":"hf.ibm-research.granite-3.2-8b-instruct-GGUF","name":"ibm-research/granite-3.2-8b-instruct-GGUF","digest":"363f0bbc3200b9c9b0ab87efe237d77b1e05bb929d5d7e4b57c1447c911223e8","size":4942859552,"modified_at":"2025-03-17T14:48:32.417Z","details":{}}]}
+```
+
+To verify that the model served by Podman AI Lab is correctly connected to Llama Stack server
+```bash
+$ llama-stack-client models list
+
+Available Models
+
+┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
+┃ model_type   ┃ identifier                                     ┃ provider_resource_id                          ┃ metadata  ┃ provider_id    ┃
+┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
+│ llm          │ ibm-research/granite-3.2-8b-instruct-GGUF      │ ibm-research/granite-3.2-8b-instruct-GGUF     │           │ podman-ai-lab  │
+└──────────────┴────────────────────────────────────────────────┴───────────────────────────────────────────────┴───────────┴────────────────┘
+
+Total models: 1
+```
--- a/llama_stack/templates/podman-ai-lab/podman_ai_lab.py
+++ b/llama_stack/templates/podman-ai-lab/podman_ai_lab.py
@ -0,0 +1,137 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from pathlib import Path
+
+from llama_stack.distribution.datatypes import (
+    ModelInput,
+    Provider,
+    ShieldInput,
+    ToolGroupInput,
+)
+from llama_stack.providers.inline.vector_io.faiss.config import FaissVectorIOConfig
+from llama_stack.providers.remote.inference.podman_ai_lab import PodmanAILabImplConfig
+from llama_stack.templates.template import DistributionTemplate, RunConfigSettings
+
+
+def get_distribution_template() -> DistributionTemplate:
+    providers = {
+        "inference": ["remote::podman-ai-lab"],
+        "vector_io": ["inline::faiss", "remote::chromadb", "remote::pgvector"],
+        "safety": ["inline::llama-guard"],
+        "agents": ["inline::meta-reference"],
+        "telemetry": ["inline::meta-reference"],
+        "eval": ["inline::meta-reference"],
+        "datasetio": ["remote::huggingface", "inline::localfs"],
+        "scoring": ["inline::basic", "inline::llm-as-judge", "inline::braintrust"],
+        "tool_runtime": [
+            "remote::brave-search",
+            "remote::tavily-search",
+            "inline::code-interpreter",
+            "inline::rag-runtime",
+            "remote::model-context-protocol",
+            "remote::wolfram-alpha",
+        ],
+    }
+    name = "podman-ai-lab"
+    inference_provider = Provider(
+        provider_id="podman-ai-lab",
+        provider_type="remote::podman-ai-lab",
+        config=PodmanAILabImplConfig.sample_run_config(),
+    )
+    vector_io_provider_faiss = Provider(
+        provider_id="faiss",
+        provider_type="inline::faiss",
+        config=FaissVectorIOConfig.sample_run_config(f"~/.llama/distributions/{name}"),
+    )
+
+    safety_model = ModelInput(
+        model_id="${env.SAFETY_MODEL}",
+        provider_id="ollama",
+    )
+    default_tool_groups = [
+        ToolGroupInput(
+            toolgroup_id="builtin::websearch",
+            provider_id="tavily-search",
+        ),
+        ToolGroupInput(
+            toolgroup_id="builtin::rag",
+            provider_id="rag-runtime",
+        ),
+        ToolGroupInput(
+            toolgroup_id="builtin::code_interpreter",
+            provider_id="code-interpreter",
+        ),
+        ToolGroupInput(
+            toolgroup_id="builtin::wolfram_alpha",
+            provider_id="wolfram-alpha",
+        ),
+    ]
+
+    return DistributionTemplate(
+        name=name,
+        distro_type="self_hosted",
+        description="Use (an external) Podman AI Lab server for running LLM inference",
+        container_image=None,
+        template_path=Path(__file__).parent / "doc_template.md",
+        providers=providers,
+        run_configs={
+            "run.yaml": RunConfigSettings(
+                provider_overrides={
+                    "inference": [inference_provider],
+                    "vector_io": [vector_io_provider_faiss],
+                },
+                default_models=[],
+                default_tool_groups=default_tool_groups,
+            ),
+            "run-with-safety.yaml": RunConfigSettings(
+                provider_overrides={
+                    "inference": [inference_provider],
+                    "vector_io": [vector_io_provider_faiss],
+                    "safety": [
+                        Provider(
+                            provider_id="llama-guard",
+                            provider_type="inline::llama-guard",
+                            config={},
+                        ),
+                        Provider(
+                            provider_id="code-scanner",
+                            provider_type="inline::code-scanner",
+                            config={},
+                        ),
+                    ],
+                },
+                default_models=[
+                    safety_model,
+                ],
+                default_shields=[
+                    ShieldInput(
+                        shield_id="${env.SAFETY_MODEL}",
+                        provider_id="llama-guard",
+                    ),
+                    ShieldInput(
+                        shield_id="CodeScanner",
+                        provider_id="code-scanner",
+                    ),
+                ],
+                default_tool_groups=default_tool_groups,
+            ),
+        },
+        run_config_env_vars={
+            "LLAMA_STACK_PORT": (
+                "5001",
+                "Port for the Llama Stack distribution server",
+            ),
+            "PODMAN_AI_LAB_URL": (
+                "http://127.0.0.1:10434",
+                "URL of the Podman AI Lab server",
+            ),
+            "SAFETY_MODEL": (
+                "meta-llama/Llama-Guard-3-1B",
+                "Safety model loaded into the Ollama server",
+            ),
+        },
+    )
--- a/llama_stack/templates/podman-ai-lab/report.md
+++ b/llama_stack/templates/podman-ai-lab/report.md
@ -0,0 +1,44 @@
+# Report for Podman AI Lab distribution
+
+## Supported Models
+| Model Descriptor | ollama |
+|:---|:---|
+| Llama-3-8B-Instruct | ❌ |
+| Llama-3-70B-Instruct | ❌ |
+| Llama3.1-8B-Instruct | ✅ |
+| Llama3.1-70B-Instruct | ✅ |
+| Llama3.1-405B-Instruct | ✅ |
+| Llama3.2-1B-Instruct | ✅ |
+| Llama3.2-3B-Instruct | ✅ |
+| Llama3.2-11B-Vision-Instruct | ✅ |
+| Llama3.2-90B-Vision-Instruct | ✅ |
+| Llama3.3-70B-Instruct | ✅ |
+| Llama-Guard-3-11B-Vision | ❌ |
+| Llama-Guard-3-1B | ✅ |
+| Llama-Guard-3-8B | ✅ |
+| Llama-Guard-2-8B | ❌ |
+
+## Inference
+| Model | API | Capability | Test | Status |
+|:----- |:-----|:-----|:-----|:-----|
+| Llama-3.1-8B-Instruct | /chat_completion | streaming | test_text_chat_completion_streaming | ✅ |
+| Llama-3.2-11B-Vision-Instruct | /chat_completion | streaming | test_image_chat_completion_streaming | ❌ |
+| Llama-3.2-11B-Vision-Instruct | /chat_completion | non_streaming | test_image_chat_completion_non_streaming | ❌ |
+| Llama-3.1-8B-Instruct | /chat_completion | non_streaming | test_text_chat_completion_non_streaming | ✅ |
+| Llama-3.1-8B-Instruct | /chat_completion | tool_calling | test_text_chat_completion_with_tool_calling_and_streaming | ✅ |
+| Llama-3.1-8B-Instruct | /chat_completion | tool_calling | test_text_chat_completion_with_tool_calling_and_non_streaming | ✅ |
+| Llama-3.1-8B-Instruct | /completion | streaming | test_text_completion_streaming | ✅ |
+| Llama-3.1-8B-Instruct | /completion | non_streaming | test_text_completion_non_streaming | ✅ |
+| Llama-3.1-8B-Instruct | /completion | structured_output | test_text_completion_structured_output | ✅ |
+
+## Vector IO
+| API | Capability | Test | Status |
+|:-----|:-----|:-----|:-----|
+| /retrieve |  | test_vector_db_retrieve | ✅ |
+
+## Agents
+| API | Capability | Test | Status |
+|:-----|:-----|:-----|:-----|
+| /create_agent_turn | rag | test_rag_agent | ✅ |
+| /create_agent_turn | custom_tool | test_custom_tool | ✅ |
+| /create_agent_turn | code_execution | test_code_interpreter_for_attachments | ✅ |
--- a/llama_stack/templates/podman-ai-lab/run-with-safety.yaml
+++ b/llama_stack/templates/podman-ai-lab/run-with-safety.yaml
@ -0,0 +1,133 @@
+version: '2'
+image_name: podman-ai-lab
+apis:
+- agents
+- datasetio
+- eval
+- inference
+- safety
+- scoring
+- telemetry
+- tool_runtime
+- vector_io
+providers:
+  inference:
+  - provider_id: podman-ai-lab
+    provider_type: remote::podman-ai-lab
+    config:
+      url: ${env.PODMAN_AI_LAB_URL:http://localhost:10434}
+  vector_io:
+  - provider_id: faiss
+    provider_type: inline::faiss
+    config:
+      kvstore:
+        type: sqlite
+        namespace: null
+        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/podman-ai-lab}/faiss_store.db
+  safety:
+  - provider_id: llama-guard
+    provider_type: inline::llama-guard
+    config: {}
+  - provider_id: code-scanner
+    provider_type: inline::code-scanner
+    config: {}
+  agents:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      persistence_store:
+        type: sqlite
+        namespace: null
+        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/podman-ai-lab}/agents_store.db
+  telemetry:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      sinks: ${env.TELEMETRY_SINKS:console,sqlite}
+      sqlite_db_path: ${env.SQLITE_DB_PATH:~/.llama/distributions/podman-ai-lab/trace_store.db}
+  eval:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      kvstore:
+        type: sqlite
+        namespace: null
+        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/podman-ai-lab}/meta_reference_eval.db
+  datasetio:
+  - provider_id: huggingface
+    provider_type: remote::huggingface
+    config:
+      kvstore:
+        type: sqlite
+        namespace: null
+        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/podman-ai-lab}/huggingface_datasetio.db
+  - provider_id: localfs
+    provider_type: inline::localfs
+    config:
+      kvstore:
+        type: sqlite
+        namespace: null
+        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/podman-ai-lab}/localfs_datasetio.db
+  scoring:
+  - provider_id: basic
+    provider_type: inline::basic
+    config: {}
+  - provider_id: llm-as-judge
+    provider_type: inline::llm-as-judge
+    config: {}
+  - provider_id: braintrust
+    provider_type: inline::braintrust
+    config:
+      openai_api_key: ${env.OPENAI_API_KEY:}
+  tool_runtime:
+  - provider_id: brave-search
+    provider_type: remote::brave-search
+    config:
+      api_key: ${env.BRAVE_SEARCH_API_KEY:}
+      max_results: 3
+  - provider_id: tavily-search
+    provider_type: remote::tavily-search
+    config:
+      api_key: ${env.TAVILY_SEARCH_API_KEY:}
+      max_results: 3
+  - provider_id: code-interpreter
+    provider_type: inline::code-interpreter
+    config: {}
+  - provider_id: rag-runtime
+    provider_type: inline::rag-runtime
+    config: {}
+  - provider_id: model-context-protocol
+    provider_type: remote::model-context-protocol
+    config: {}
+  - provider_id: wolfram-alpha
+    provider_type: remote::wolfram-alpha
+    config:
+      api_key: ${env.WOLFRAM_ALPHA_API_KEY:}
+metadata_store:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/podman-ai-lab}/registry.db
+models:
+- metadata: {}
+  model_id: ${env.SAFETY_MODEL}
+  provider_id: ollama
+  model_type: llm
+shields:
+- shield_id: ${env.SAFETY_MODEL}
+  provider_id: llama-guard
+- shield_id: CodeScanner
+  provider_id: code-scanner
+vector_dbs: []
+datasets: []
+scoring_fns: []
+benchmarks: []
+tool_groups:
+- toolgroup_id: builtin::websearch
+  provider_id: tavily-search
+- toolgroup_id: builtin::rag
+  provider_id: rag-runtime
+- toolgroup_id: builtin::code_interpreter
+  provider_id: code-interpreter
+- toolgroup_id: builtin::wolfram_alpha
+  provider_id: wolfram-alpha
+server:
+  port: 8321
--- a/llama_stack/templates/podman-ai-lab/run.yaml
+++ b/llama_stack/templates/podman-ai-lab/run.yaml
@ -0,0 +1,123 @@
+version: '2'
+image_name: podman-ai-lab
+apis:
+- agents
+- datasetio
+- eval
+- inference
+- safety
+- scoring
+- telemetry
+- tool_runtime
+- vector_io
+providers:
+  inference:
+  - provider_id: podman-ai-lab
+    provider_type: remote::podman-ai-lab
+    config:
+      url: ${env.PODMAN_AI_LAB_URL:http://localhost:10434}
+  vector_io:
+  - provider_id: faiss
+    provider_type: inline::faiss
+    config:
+      kvstore:
+        type: sqlite
+        namespace: null
+        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/podman-ai-lab}/faiss_store.db
+  safety:
+  - provider_id: llama-guard
+    provider_type: inline::llama-guard
+    config:
+      excluded_categories: []
+  agents:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      persistence_store:
+        type: sqlite
+        namespace: null
+        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/podman-ai-lab}/agents_store.db
+  telemetry:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      sinks: ${env.TELEMETRY_SINKS:console,sqlite}
+      sqlite_db_path: ${env.SQLITE_DB_PATH:~/.llama/distributions/podman-ai-lab/trace_store.db}
+  eval:
+  - provider_id: meta-reference
+    provider_type: inline::meta-reference
+    config:
+      kvstore:
+        type: sqlite
+        namespace: null
+        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/podman-ai-lab}/meta_reference_eval.db
+  datasetio:
+  - provider_id: huggingface
+    provider_type: remote::huggingface
+    config:
+      kvstore:
+        type: sqlite
+        namespace: null
+        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/podman-ai-lab}/huggingface_datasetio.db
+  - provider_id: localfs
+    provider_type: inline::localfs
+    config:
+      kvstore:
+        type: sqlite
+        namespace: null
+        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/podman-ai-lab}/localfs_datasetio.db
+  scoring:
+  - provider_id: basic
+    provider_type: inline::basic
+    config: {}
+  - provider_id: llm-as-judge
+    provider_type: inline::llm-as-judge
+    config: {}
+  - provider_id: braintrust
+    provider_type: inline::braintrust
+    config:
+      openai_api_key: ${env.OPENAI_API_KEY:}
+  tool_runtime:
+  - provider_id: brave-search
+    provider_type: remote::brave-search
+    config:
+      api_key: ${env.BRAVE_SEARCH_API_KEY:}
+      max_results: 3
+  - provider_id: tavily-search
+    provider_type: remote::tavily-search
+    config:
+      api_key: ${env.TAVILY_SEARCH_API_KEY:}
+      max_results: 3
+  - provider_id: code-interpreter
+    provider_type: inline::code-interpreter
+    config: {}
+  - provider_id: rag-runtime
+    provider_type: inline::rag-runtime
+    config: {}
+  - provider_id: model-context-protocol
+    provider_type: remote::model-context-protocol
+    config: {}
+  - provider_id: wolfram-alpha
+    provider_type: remote::wolfram-alpha
+    config:
+      api_key: ${env.WOLFRAM_ALPHA_API_KEY:}
+metadata_store:
+  type: sqlite
+  db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/podman-ai-lab}/registry.db
+models: []
+shields: []
+vector_dbs: []
+datasets: []
+scoring_fns: []
+benchmarks: []
+tool_groups:
+- toolgroup_id: builtin::websearch
+  provider_id: tavily-search
+- toolgroup_id: builtin::rag
+  provider_id: rag-runtime
+- toolgroup_id: builtin::code_interpreter
+  provider_id: code-interpreter
+- toolgroup_id: builtin::wolfram_alpha
+  provider_id: wolfram-alpha
+server:
+  port: 8321
--- a/pyproject.toml
+++ b/pyproject.toml
@ -259,6 +259,7 @@ exclude = [
    "^llama_stack/providers/remote/inference/nvidia/",
    "^llama_stack/providers/remote/inference/openai/",
    "^llama_stack/providers/remote/inference/passthrough/",
+    "^llama_stack/providers/remote/inference/podman_ai_lab/",
    "^llama_stack/providers/remote/inference/runpod/",
    "^llama_stack/providers/remote/inference/sambanova/",
    "^llama_stack/providers/remote/inference/sample/",