Remove "routing_table" and "routing_key" concepts for the user (#201)

This PR makes several core changes to the developer experience surrounding Llama Stack. Background: PR #92 introduced the notion of "routing" to the Llama Stack. It introduces three object types: (1) models, (2) shields and (3) memory banks. Each of these objects can be associated with a distinct provider. So you can get model A to be inferenced locally while model B, C can be inference remotely (e.g.) However, this had a few drawbacks: you could not address the provider instances -- i.e., if you configured "meta-reference" with a given model, you could not assign an identifier to this instance which you could re-use later. the above meant that you could not register a "routing_key" (e.g. model) dynamically and say "please use this existing provider I have already configured" for a new model. the terms "routing_table" and "routing_key" were exposed directly to the user. in my view, this is way too much overhead for a new user (which almost everyone is.) people come to the stack wanting to do ML and encounter a completely unexpected term. What this PR does: This PR structures the run config with only a single prominent key: - providers Providers are instances of configured provider types. Here's an example which shows two instances of the remote::tgi provider which are serving two different models. providers: inference: - provider_id: foo provider_type: remote::tgi config: { ... } - provider_id: bar provider_type: remote::tgi config: { ... } Secondly, the PR adds dynamic registration of { models | shields | memory_banks } to the API surface. The distribution still acts like a "routing table" (as previously) except that it asks the backing providers for a listing of these objects. For example it asks a TGI or Ollama inference adapter what models it is serving. Only the models that are being actually served can be requested by the user for inference. Otherwise, the Stack server will throw an error. When dynamically registering these objects, you can use the provider IDs shown above. Info about providers can be obtained using the Api.inspect set of endpoints (/providers, /routes, etc.) The above examples shows the correspondence between inference providers and models registry items. Things work similarly for the safety <=> shields and memory <=> memory_banks pairs. Registry: This PR also makes it so that Providers need to implement additional methods for registering and listing objects. For example, each Inference provider is now expected to implement the ModelsProtocolPrivate protocol (naming is not great!) which consists of two methods register_model list_models The goal is to inform the provider that a certain model needs to be supported so the provider can make any relevant backend changes if needed (or throw an error if the model cannot be supported.) There are many other cleanups included some of which are detailed in a follow-up comment.
2024-10-10 10:24:13 -07:00 · 2024-10-10 10:24:13 -07:00 · 6bb57e72a7
commit 6bb57e72a7
parent 8c3010553f
93 changed files with 4697 additions and 4457 deletions
--- a/llama_stack/apis/agents/agents.py
+++ b/llama_stack/apis/agents/agents.py
@ -6,7 +6,16 @@

 from datetime import datetime
 from enum import Enum
-from typing import Any, Dict, List, Literal, Optional, Protocol, Union
+from typing import (
+    Any,
+    Dict,
+    List,
+    Literal,
+    Optional,
+    Protocol,
+    runtime_checkable,
+    Union,
+)

 from llama_models.schema_utils import json_schema_type, webmethod

@ -261,7 +270,7 @@ class Session(BaseModel):
    turns: List[Turn]
    started_at: datetime

-    memory_bank: Optional[MemoryBank] = None
+    memory_bank: Optional[MemoryBankDef] = None


 class AgentConfigCommon(BaseModel):
@ -404,6 +413,7 @@ class AgentStepResponse(BaseModel):
    step: Step


+@runtime_checkable
 class Agents(Protocol):
    @webmethod(route="/agents/create")
    async def create_agent(
@ -411,8 +421,10 @@ class Agents(Protocol):
        agent_config: AgentConfig,
    ) -> AgentCreateResponse: ...

+    # This method is not `async def` because it can result in either an
+    # `AsyncGenerator` or a `AgentTurnCreateResponse` depending on the value of `stream`.
    @webmethod(route="/agents/turn/create")
-    async def create_agent_turn(
+    def create_agent_turn(
        self,
        agent_id: str,
        session_id: str,
--- a/llama_stack/apis/agents/client.py
+++ b/llama_stack/apis/agents/client.py
@ -7,7 +7,7 @@
 import asyncio
 import json
 import os
-from typing import AsyncGenerator
+from typing import AsyncGenerator, Optional

 import fire
 import httpx
@ -67,9 +67,17 @@ class AgentsClient(Agents):
            response.raise_for_status()
            return AgentSessionCreateResponse(**response.json())

-    async def create_agent_turn(
+    def create_agent_turn(
        self,
        request: AgentTurnCreateRequest,
+    ) -> AsyncGenerator:
+        if request.stream:
+            return self._stream_agent_turn(request)
+        else:
+            return self._nonstream_agent_turn(request)
+
+    async def _stream_agent_turn(
+        self, request: AgentTurnCreateRequest
    ) -> AsyncGenerator:
        async with httpx.AsyncClient() as client:
            async with client.stream(
@ -93,6 +101,9 @@ class AgentsClient(Agents):
                            print(data)
                            print(f"Error with parsing or validation: {e}")

+    async def _nonstream_agent_turn(self, request: AgentTurnCreateRequest):
+        raise NotImplementedError("Non-streaming not implemented yet")
+

 async def _run_agent(
    api, model, tool_definitions, tool_prompt_format, user_prompts, attachments=None
@ -132,8 +143,7 @@ async def _run_agent(
                log.print()


-async def run_llama_3_1(host: str, port: int):
-    model = "Llama3.1-8B-Instruct"
+async def run_llama_3_1(host: str, port: int, model: str = "Llama3.1-8B-Instruct"):
    api = AgentsClient(f"http://{host}:{port}")

    tool_definitions = [
@ -173,8 +183,7 @@ async def run_llama_3_1(host: str, port: int):
    await _run_agent(api, model, tool_definitions, ToolPromptFormat.json, user_prompts)


-async def run_llama_3_2_rag(host: str, port: int):
-    model = "Llama3.2-3B-Instruct"
+async def run_llama_3_2_rag(host: str, port: int, model: str = "Llama3.2-3B-Instruct"):
    api = AgentsClient(f"http://{host}:{port}")

    urls = [
@ -215,8 +224,7 @@ async def run_llama_3_2_rag(host: str, port: int):
    )


-async def run_llama_3_2(host: str, port: int):
-    model = "Llama3.2-3B-Instruct"
+async def run_llama_3_2(host: str, port: int, model: str = "Llama3.2-3B-Instruct"):
    api = AgentsClient(f"http://{host}:{port}")

    # zero shot tools for llama3.2 text models
@ -262,7 +270,7 @@ async def run_llama_3_2(host: str, port: int):
    )


-def main(host: str, port: int, run_type: str):
+def main(host: str, port: int, run_type: str, model: Optional[str] = None):
    assert run_type in [
        "tools_llama_3_1",
        "tools_llama_3_2",
@ -274,7 +282,10 @@ def main(host: str, port: int, run_type: str):
        "tools_llama_3_2": run_llama_3_2,
        "rag_llama_3_2": run_llama_3_2_rag,
    }
-    asyncio.run(fn[run_type](host, port))
+    args = [host, port]
+    if model is not None:
+        args.append(model)
+    asyncio.run(fn[run_type](*args))


 if __name__ == "__main__":
--- a/llama_stack/apis/batch_inference/batch_inference.py
+++ b/llama_stack/apis/batch_inference/batch_inference.py
@ -4,7 +4,7 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from typing import List, Optional, Protocol
+from typing import List, Optional, Protocol, runtime_checkable

 from llama_models.schema_utils import json_schema_type, webmethod

@ -47,6 +47,7 @@ class BatchChatCompletionResponse(BaseModel):
    completion_message_batch: List[CompletionMessage]


+@runtime_checkable
 class BatchInference(Protocol):
    @webmethod(route="/batch_inference/completion")
    async def batch_completion(
--- a/llama_stack/apis/inference/client.py
+++ b/llama_stack/apis/inference/client.py
@ -42,10 +42,10 @@ class InferenceClient(Inference):
    async def shutdown(self) -> None:
        pass

-    async def completion(self, request: CompletionRequest) -> AsyncGenerator:
+    def completion(self, request: CompletionRequest) -> AsyncGenerator:
        raise NotImplementedError()

-    async def chat_completion(
+    def chat_completion(
        self,
        model: str,
        messages: List[Message],
@ -66,6 +66,29 @@ class InferenceClient(Inference):
            stream=stream,
            logprobs=logprobs,
        )
+        if stream:
+            return self._stream_chat_completion(request)
+        else:
+            return self._nonstream_chat_completion(request)
+
+    async def _nonstream_chat_completion(
+        self, request: ChatCompletionRequest
+    ) -> ChatCompletionResponse:
+        async with httpx.AsyncClient() as client:
+            response = await client.post(
+                f"{self.base_url}/inference/chat_completion",
+                json=encodable_dict(request),
+                headers={"Content-Type": "application/json"},
+                timeout=20,
+            )
+
+            response.raise_for_status()
+            j = response.json()
+            return ChatCompletionResponse(**j)
+
+    async def _stream_chat_completion(
+        self, request: ChatCompletionRequest
+    ) -> AsyncGenerator:
        async with httpx.AsyncClient() as client:
            async with client.stream(
                "POST",
@ -77,7 +100,8 @@ class InferenceClient(Inference):
                if response.status_code != 200:
                    content = await response.aread()
                    cprint(
-                        f"Error: HTTP {response.status_code} {content.decode()}", "red"
+                        f"Error: HTTP {response.status_code} {content.decode()}",
+                        "red",
                    )
                    return

@ -85,16 +109,11 @@ class InferenceClient(Inference):
                    if line.startswith("data:"):
                        data = line[len("data: ") :]
                        try:
-                            if request.stream:
-                                if "error" in data:
-                                    cprint(data, "red")
-                                    continue
+                            if "error" in data:
+                                cprint(data, "red")
+                                continue

-                                yield ChatCompletionResponseStreamChunk(
-                                    **json.loads(data)
-                                )
-                            else:
-                                yield ChatCompletionResponse(**json.loads(data))
+                            yield ChatCompletionResponseStreamChunk(**json.loads(data))
                        except Exception as e:
                            print(data)
                            print(f"Error with parsing or validation: {e}")
--- a/llama_stack/apis/inference/inference.py
+++ b/llama_stack/apis/inference/inference.py
@ -6,7 +6,7 @@

 from enum import Enum

-from typing import List, Literal, Optional, Protocol, Union
+from typing import List, Literal, Optional, Protocol, runtime_checkable, Union

 from llama_models.schema_utils import json_schema_type, webmethod

@ -14,6 +14,7 @@ from pydantic import BaseModel, Field
 from typing_extensions import Annotated

 from llama_models.llama3.api.datatypes import *  # noqa: F403
+from llama_stack.apis.models import *  # noqa: F403


 class LogProbConfig(BaseModel):
@ -172,9 +173,18 @@ class EmbeddingsResponse(BaseModel):
    embeddings: List[List[float]]


+class ModelStore(Protocol):
+    def get_model(self, identifier: str) -> ModelDef: ...
+
+
+@runtime_checkable
 class Inference(Protocol):
+    model_store: ModelStore
+
+    # This method is not `async def` because it can result in either an
+    # `AsyncGenerator` or a `CompletionResponse` depending on the value of `stream`.
    @webmethod(route="/inference/completion")
-    async def completion(
+    def completion(
        self,
        model: str,
        content: InterleavedTextMedia,
@ -183,8 +193,10 @@ class Inference(Protocol):
        logprobs: Optional[LogProbConfig] = None,
    ) -> Union[CompletionResponse, CompletionResponseStreamChunk]: ...

+    # This method is not `async def` because it can result in either an
+    # `AsyncGenerator` or a `ChatCompletionResponse` depending on the value of `stream`.
    @webmethod(route="/inference/chat_completion")
-    async def chat_completion(
+    def chat_completion(
        self,
        model: str,
        messages: List[Message],
--- a/llama_stack/apis/inspect/inspect.py
+++ b/llama_stack/apis/inspect/inspect.py
@ -4,7 +4,7 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from typing import Dict, List, Protocol
+from typing import Dict, List, Protocol, runtime_checkable

 from llama_models.schema_utils import json_schema_type, webmethod
 from pydantic import BaseModel
@ -12,15 +12,15 @@ from pydantic import BaseModel

@json_schema_type
 class ProviderInfo(BaseModel):
+    provider_id: str
    provider_type: str
-    description: str


@json_schema_type
 class RouteInfo(BaseModel):
    route: str
    method: str
-    providers: List[str]
+    provider_types: List[str]


@json_schema_type
@ -29,6 +29,7 @@ class HealthInfo(BaseModel):
    # TODO: add a provider level status


+@runtime_checkable
 class Inspect(Protocol):
    @webmethod(route="/providers/list", method="GET")
    async def list_providers(self) -> Dict[str, ProviderInfo]: ...
--- a/llama_stack/apis/memory/client.py
+++ b/llama_stack/apis/memory/client.py
@ -5,7 +5,6 @@
 # the root directory of this source tree.

 import asyncio
-import json
 import os
 from pathlib import Path

@ -13,11 +12,11 @@ from typing import Any, Dict, List, Optional

 import fire
 import httpx
-from termcolor import cprint

 from llama_stack.distribution.datatypes import RemoteProviderConfig

 from llama_stack.apis.memory import *  # noqa: F403
+from llama_stack.apis.memory_banks.client import MemoryBanksClient
 from llama_stack.providers.utils.memory.file_utils import data_url_from_file


@ -35,45 +34,6 @@ class MemoryClient(Memory):
    async def shutdown(self) -> None:
        pass

-    async def get_memory_bank(self, bank_id: str) -> Optional[MemoryBank]:
-        async with httpx.AsyncClient() as client:
-            r = await client.get(
-                f"{self.base_url}/memory/get",
-                params={
-                    "bank_id": bank_id,
-                },
-                headers={"Content-Type": "application/json"},
-                timeout=20,
-            )
-            r.raise_for_status()
-            d = r.json()
-            if not d:
-                return None
-            return MemoryBank(**d)
-
-    async def create_memory_bank(
-        self,
-        name: str,
-        config: MemoryBankConfig,
-        url: Optional[URL] = None,
-    ) -> MemoryBank:
-        async with httpx.AsyncClient() as client:
-            r = await client.post(
-                f"{self.base_url}/memory/create",
-                json={
-                    "name": name,
-                    "config": config.dict(),
-                    "url": url,
-                },
-                headers={"Content-Type": "application/json"},
-                timeout=20,
-            )
-            r.raise_for_status()
-            d = r.json()
-            if not d:
-                return None
-            return MemoryBank(**d)
-
    async def insert_documents(
        self,
        bank_id: str,
@ -113,23 +73,20 @@ class MemoryClient(Memory):


 async def run_main(host: str, port: int, stream: bool):
-    client = MemoryClient(f"http://{host}:{port}")
+    banks_client = MemoryBanksClient(f"http://{host}:{port}")

-    # create a memory bank
-    bank = await client.create_memory_bank(
-        name="test_bank",
-        config=VectorMemoryBankConfig(
-            bank_id="test_bank",
-            embedding_model="all-MiniLM-L6-v2",
-            chunk_size_in_tokens=512,
-            overlap_size_in_tokens=64,
-        ),
+    bank = VectorMemoryBankDef(
+        identifier="test_bank",
+        provider_id="",
+        embedding_model="all-MiniLM-L6-v2",
+        chunk_size_in_tokens=512,
+        overlap_size_in_tokens=64,
    )
-    cprint(json.dumps(bank.dict(), indent=4), "green")
+    await banks_client.register_memory_bank(bank)

-    retrieved_bank = await client.get_memory_bank(bank.bank_id)
+    retrieved_bank = await banks_client.get_memory_bank(bank.identifier)
    assert retrieved_bank is not None
-    assert retrieved_bank.config.embedding_model == "all-MiniLM-L6-v2"
+    assert retrieved_bank.embedding_model == "all-MiniLM-L6-v2"

    urls = [
        "memory_optimizations.rst",
@ -160,15 +117,17 @@ async def run_main(host: str, port: int, stream: bool):
        for i, path in enumerate(files)
    ]

+    client = MemoryClient(f"http://{host}:{port}")
+
    # insert some documents
    await client.insert_documents(
-        bank_id=bank.bank_id,
+        bank_id=bank.identifier,
        documents=documents,
    )

    # query the documents
    response = await client.query_documents(
-        bank_id=bank.bank_id,
+        bank_id=bank.identifier,
        query=[
            "How do I use Lora?",
        ],
@ -178,7 +137,7 @@ async def run_main(host: str, port: int, stream: bool):
        print(f"Chunk:\n========\n{chunk}\n========\n")

    response = await client.query_documents(
-        bank_id=bank.bank_id,
+        bank_id=bank.identifier,
        query=[
            "Tell me more about llama3 and torchtune",
        ],
--- a/llama_stack/apis/memory/memory.py
+++ b/llama_stack/apis/memory/memory.py
@ -8,14 +8,14 @@
 #
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.
-from typing import List, Optional, Protocol
+from typing import List, Optional, Protocol, runtime_checkable

 from llama_models.schema_utils import json_schema_type, webmethod

 from pydantic import BaseModel, Field
-from typing_extensions import Annotated

 from llama_models.llama3.api.datatypes import *  # noqa: F403
+from llama_stack.apis.memory_banks import *  # noqa: F403


@json_schema_type
@ -26,44 +26,6 @@ class MemoryBankDocument(BaseModel):
    metadata: Dict[str, Any] = Field(default_factory=dict)


-@json_schema_type
-class MemoryBankType(Enum):
-    vector = "vector"
-    keyvalue = "keyvalue"
-    keyword = "keyword"
-    graph = "graph"
-
-
-class VectorMemoryBankConfig(BaseModel):
-    type: Literal[MemoryBankType.vector.value] = MemoryBankType.vector.value
-    embedding_model: str
-    chunk_size_in_tokens: int
-    overlap_size_in_tokens: Optional[int] = None
-
-
-class KeyValueMemoryBankConfig(BaseModel):
-    type: Literal[MemoryBankType.keyvalue.value] = MemoryBankType.keyvalue.value
-
-
-class KeywordMemoryBankConfig(BaseModel):
-    type: Literal[MemoryBankType.keyword.value] = MemoryBankType.keyword.value
-
-
-class GraphMemoryBankConfig(BaseModel):
-    type: Literal[MemoryBankType.graph.value] = MemoryBankType.graph.value
-
-
-MemoryBankConfig = Annotated[
-    Union[
-        VectorMemoryBankConfig,
-        KeyValueMemoryBankConfig,
-        KeywordMemoryBankConfig,
-        GraphMemoryBankConfig,
-    ],
-    Field(discriminator="type"),
-]
-
-
 class Chunk(BaseModel):
    content: InterleavedTextMedia
    token_count: int
@ -76,45 +38,13 @@ class QueryDocumentsResponse(BaseModel):
    scores: List[float]


-@json_schema_type
-class QueryAPI(Protocol):
-    @webmethod(route="/query_documents")
-    def query_documents(
-        self,
-        query: InterleavedTextMedia,
-        params: Optional[Dict[str, Any]] = None,
-    ) -> QueryDocumentsResponse: ...
-
-
-@json_schema_type
-class MemoryBank(BaseModel):
-    bank_id: str
-    name: str
-    config: MemoryBankConfig
-    # if there's a pre-existing (reachable-from-distribution) store which supports QueryAPI
-    url: Optional[URL] = None
+class MemoryBankStore(Protocol):
+    def get_memory_bank(self, bank_id: str) -> Optional[MemoryBankDef]: ...


+@runtime_checkable
 class Memory(Protocol):
-    @webmethod(route="/memory/create")
-    async def create_memory_bank(
-        self,
-        name: str,
-        config: MemoryBankConfig,
-        url: Optional[URL] = None,
-    ) -> MemoryBank: ...
-
-    @webmethod(route="/memory/list", method="GET")
-    async def list_memory_banks(self) -> List[MemoryBank]: ...
-
-    @webmethod(route="/memory/get", method="GET")
-    async def get_memory_bank(self, bank_id: str) -> Optional[MemoryBank]: ...
-
-    @webmethod(route="/memory/drop", method="DELETE")
-    async def drop_memory_bank(
-        self,
-        bank_id: str,
-    ) -> str: ...
+    memory_bank_store: MemoryBankStore

    # this will just block now until documents are inserted, but it should
    # probably return a Job instance which can be polled for completion
@ -126,13 +56,6 @@ class Memory(Protocol):
        ttl_seconds: Optional[int] = None,
    ) -> None: ...

-    @webmethod(route="/memory/update")
-    async def update_documents(
-        self,
-        bank_id: str,
-        documents: List[MemoryBankDocument],
-    ) -> None: ...
-
    @webmethod(route="/memory/query")
    async def query_documents(
        self,
@ -140,17 +63,3 @@ class Memory(Protocol):
        query: InterleavedTextMedia,
        params: Optional[Dict[str, Any]] = None,
    ) -> QueryDocumentsResponse: ...
-
-    @webmethod(route="/memory/documents/get", method="GET")
-    async def get_documents(
-        self,
-        bank_id: str,
-        document_ids: List[str],
-    ) -> List[MemoryBankDocument]: ...
-
-    @webmethod(route="/memory/documents/delete", method="DELETE")
-    async def delete_documents(
-        self,
-        bank_id: str,
-        document_ids: List[str],
-    ) -> None: ...
--- a/llama_stack/apis/memory_banks/client.py
+++ b/llama_stack/apis/memory_banks/client.py
@ -5,8 +5,9 @@
 # the root directory of this source tree.

 import asyncio
+import json

-from typing import List, Optional
+from typing import Any, Dict, List, Optional

 import fire
 import httpx
@ -15,6 +16,27 @@ from termcolor import cprint
 from .memory_banks import *  # noqa: F403


+def deserialize_memory_bank_def(
+    j: Optional[Dict[str, Any]]
+) -> MemoryBankDefWithProvider:
+    if j is None:
+        return None
+
+    if "type" not in j:
+        raise ValueError("Memory bank type not specified")
+    type = j["type"]
+    if type == MemoryBankType.vector.value:
+        return VectorMemoryBankDef(**j)
+    elif type == MemoryBankType.keyvalue.value:
+        return KeyValueMemoryBankDef(**j)
+    elif type == MemoryBankType.keyword.value:
+        return KeywordMemoryBankDef(**j)
+    elif type == MemoryBankType.graph.value:
+        return GraphMemoryBankDef(**j)
+    else:
+        raise ValueError(f"Unknown memory bank type: {type}")
+
+
 class MemoryBanksClient(MemoryBanks):
    def __init__(self, base_url: str):
        self.base_url = base_url
@ -25,37 +47,49 @@ class MemoryBanksClient(MemoryBanks):
    async def shutdown(self) -> None:
        pass

-    async def list_available_memory_banks(self) -> List[MemoryBankSpec]:
+    async def list_memory_banks(self) -> List[MemoryBankDefWithProvider]:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"{self.base_url}/memory_banks/list",
                headers={"Content-Type": "application/json"},
            )
            response.raise_for_status()
-            return [MemoryBankSpec(**x) for x in response.json()]
+            return [deserialize_memory_bank_def(x) for x in response.json()]

-    async def get_serving_memory_bank(
-        self, bank_type: MemoryBankType
-    ) -> Optional[MemoryBankSpec]:
+    async def register_memory_bank(
+        self, memory_bank: MemoryBankDefWithProvider
+    ) -> None:
+        async with httpx.AsyncClient() as client:
+            response = await client.post(
+                f"{self.base_url}/memory_banks/register",
+                json={
+                    "memory_bank": json.loads(memory_bank.json()),
+                },
+                headers={"Content-Type": "application/json"},
+            )
+            response.raise_for_status()
+
+    async def get_memory_bank(
+        self,
+        identifier: str,
+    ) -> Optional[MemoryBankDefWithProvider]:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"{self.base_url}/memory_banks/get",
                params={
-                    "bank_type": bank_type.value,
+                    "identifier": identifier,
                },
                headers={"Content-Type": "application/json"},
            )
            response.raise_for_status()
            j = response.json()
-            if j is None:
-                return None
-            return MemoryBankSpec(**j)
+            return deserialize_memory_bank_def(j)


 async def run_main(host: str, port: int, stream: bool):
    client = MemoryBanksClient(f"http://{host}:{port}")

-    response = await client.list_available_memory_banks()
+    response = await client.list_memory_banks()
    cprint(f"list_memory_banks response={response}", "green")


--- a/llama_stack/apis/memory_banks/memory_banks.py
+++ b/llama_stack/apis/memory_banks/memory_banks.py
@ -4,29 +4,75 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from typing import List, Optional, Protocol
+from enum import Enum
+from typing import List, Literal, Optional, Protocol, runtime_checkable, Union

 from llama_models.schema_utils import json_schema_type, webmethod
 from pydantic import BaseModel, Field
-
-from llama_stack.apis.memory import MemoryBankType
-
-from llama_stack.distribution.datatypes import GenericProviderConfig
+from typing_extensions import Annotated


@json_schema_type
-class MemoryBankSpec(BaseModel):
-    bank_type: MemoryBankType
-    provider_config: GenericProviderConfig = Field(
-        description="Provider config for the model, including provider_type, and corresponding config. ",
-    )
+class MemoryBankType(Enum):
+    vector = "vector"
+    keyvalue = "keyvalue"
+    keyword = "keyword"
+    graph = "graph"


+class CommonDef(BaseModel):
+    identifier: str
+    # Hack: move this out later
+    provider_id: str = ""
+
+
+@json_schema_type
+class VectorMemoryBankDef(CommonDef):
+    type: Literal[MemoryBankType.vector.value] = MemoryBankType.vector.value
+    embedding_model: str
+    chunk_size_in_tokens: int
+    overlap_size_in_tokens: Optional[int] = None
+
+
+@json_schema_type
+class KeyValueMemoryBankDef(CommonDef):
+    type: Literal[MemoryBankType.keyvalue.value] = MemoryBankType.keyvalue.value
+
+
+@json_schema_type
+class KeywordMemoryBankDef(CommonDef):
+    type: Literal[MemoryBankType.keyword.value] = MemoryBankType.keyword.value
+
+
+@json_schema_type
+class GraphMemoryBankDef(CommonDef):
+    type: Literal[MemoryBankType.graph.value] = MemoryBankType.graph.value
+
+
+MemoryBankDef = Annotated[
+    Union[
+        VectorMemoryBankDef,
+        KeyValueMemoryBankDef,
+        KeywordMemoryBankDef,
+        GraphMemoryBankDef,
+    ],
+    Field(discriminator="type"),
+]
+
+MemoryBankDefWithProvider = MemoryBankDef
+
+
+@runtime_checkable
 class MemoryBanks(Protocol):
    @webmethod(route="/memory_banks/list", method="GET")
-    async def list_available_memory_banks(self) -> List[MemoryBankSpec]: ...
+    async def list_memory_banks(self) -> List[MemoryBankDefWithProvider]: ...

    @webmethod(route="/memory_banks/get", method="GET")
-    async def get_serving_memory_bank(
-        self, bank_type: MemoryBankType
-    ) -> Optional[MemoryBankSpec]: ...
+    async def get_memory_bank(
+        self, identifier: str
+    ) -> Optional[MemoryBankDefWithProvider]: ...
+
+    @webmethod(route="/memory_banks/register", method="POST")
+    async def register_memory_bank(
+        self, memory_bank: MemoryBankDefWithProvider
+    ) -> None: ...
--- a/llama_stack/apis/models/client.py
+++ b/llama_stack/apis/models/client.py
@ -5,6 +5,7 @@
 # the root directory of this source tree.

 import asyncio
+import json

 from typing import List, Optional

@ -25,21 +26,32 @@ class ModelsClient(Models):
    async def shutdown(self) -> None:
        pass

-    async def list_models(self) -> List[ModelServingSpec]:
+    async def list_models(self) -> List[ModelDefWithProvider]:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"{self.base_url}/models/list",
                headers={"Content-Type": "application/json"},
            )
            response.raise_for_status()
-            return [ModelServingSpec(**x) for x in response.json()]
+            return [ModelDefWithProvider(**x) for x in response.json()]

-    async def get_model(self, core_model_id: str) -> Optional[ModelServingSpec]:
+    async def register_model(self, model: ModelDefWithProvider) -> None:
+        async with httpx.AsyncClient() as client:
+            response = await client.post(
+                f"{self.base_url}/models/register",
+                json={
+                    "model": json.loads(model.json()),
+                },
+                headers={"Content-Type": "application/json"},
+            )
+            response.raise_for_status()
+
+    async def get_model(self, identifier: str) -> Optional[ModelDefWithProvider]:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"{self.base_url}/models/get",
                params={
-                    "core_model_id": core_model_id,
+                    "identifier": identifier,
                },
                headers={"Content-Type": "application/json"},
            )
@ -47,7 +59,7 @@ class ModelsClient(Models):
            j = response.json()
            if j is None:
                return None
-            return ModelServingSpec(**j)
+            return ModelDefWithProvider(**j)


 async def run_main(host: str, port: int, stream: bool):
--- a/llama_stack/apis/models/models.py
+++ b/llama_stack/apis/models/models.py
@ -4,29 +4,39 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from typing import List, Optional, Protocol
-
-from llama_models.llama3.api.datatypes import Model
+from typing import Any, Dict, List, Optional, Protocol, runtime_checkable

 from llama_models.schema_utils import json_schema_type, webmethod
 from pydantic import BaseModel, Field

-from llama_stack.distribution.datatypes import GenericProviderConfig
+
+class ModelDef(BaseModel):
+    identifier: str = Field(
+        description="A unique name for the model type",
+    )
+    llama_model: str = Field(
+        description="Pointer to the underlying core Llama family model. Each model served by Llama Stack must have a core Llama model.",
+    )
+    metadata: Dict[str, Any] = Field(
+        default_factory=dict,
+        description="Any additional metadata for this model",
+    )


@json_schema_type
-class ModelServingSpec(BaseModel):
-    llama_model: Model = Field(
-        description="All metadatas associated with llama model (defined in llama_models.models.sku_list).",
-    )
-    provider_config: GenericProviderConfig = Field(
-        description="Provider config for the model, including provider_type, and corresponding config. ",
+class ModelDefWithProvider(ModelDef):
+    provider_id: str = Field(
+        description="The provider ID for this model",
    )


+@runtime_checkable
 class Models(Protocol):
    @webmethod(route="/models/list", method="GET")
-    async def list_models(self) -> List[ModelServingSpec]: ...
+    async def list_models(self) -> List[ModelDefWithProvider]: ...

    @webmethod(route="/models/get", method="GET")
-    async def get_model(self, core_model_id: str) -> Optional[ModelServingSpec]: ...
+    async def get_model(self, identifier: str) -> Optional[ModelDefWithProvider]: ...
+
+    @webmethod(route="/models/register", method="POST")
+    async def register_model(self, model: ModelDefWithProvider) -> None: ...
--- a/llama_stack/apis/safety/client.py
+++ b/llama_stack/apis/safety/client.py
@ -96,12 +96,6 @@ async def run_main(host: str, port: int, image_path: str = None):
        )
        print(response)

-        response = await client.run_shield(
-            shield_type="injection_shield",
-            messages=[message],
-        )
-        print(response)
-

 def main(host: str, port: int, image: str = None):
    asyncio.run(run_main(host, port, image))
--- a/llama_stack/apis/safety/safety.py
+++ b/llama_stack/apis/safety/safety.py
@ -5,12 +5,13 @@
 # the root directory of this source tree.

 from enum import Enum
-from typing import Any, Dict, List, Protocol
+from typing import Any, Dict, List, Protocol, runtime_checkable

 from llama_models.schema_utils import json_schema_type, webmethod
 from pydantic import BaseModel

 from llama_models.llama3.api.datatypes import *  # noqa: F403
+from llama_stack.apis.shields import *  # noqa: F403


@json_schema_type
@ -37,7 +38,14 @@ class RunShieldResponse(BaseModel):
    violation: Optional[SafetyViolation] = None


+class ShieldStore(Protocol):
+    def get_shield(self, identifier: str) -> ShieldDef: ...
+
+
+@runtime_checkable
 class Safety(Protocol):
+    shield_store: ShieldStore
+
    @webmethod(route="/safety/run_shield")
    async def run_shield(
        self, shield_type: str, messages: List[Message], params: Dict[str, Any] = None
--- a/llama_stack/apis/shields/client.py
+++ b/llama_stack/apis/shields/client.py
@ -5,6 +5,7 @@
 # the root directory of this source tree.

 import asyncio
+import json

 from typing import List, Optional

@ -25,16 +26,27 @@ class ShieldsClient(Shields):
    async def shutdown(self) -> None:
        pass

-    async def list_shields(self) -> List[ShieldSpec]:
+    async def list_shields(self) -> List[ShieldDefWithProvider]:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"{self.base_url}/shields/list",
                headers={"Content-Type": "application/json"},
            )
            response.raise_for_status()
-            return [ShieldSpec(**x) for x in response.json()]
+            return [ShieldDefWithProvider(**x) for x in response.json()]

-    async def get_shield(self, shield_type: str) -> Optional[ShieldSpec]:
+    async def register_shield(self, shield: ShieldDefWithProvider) -> None:
+        async with httpx.AsyncClient() as client:
+            response = await client.post(
+                f"{self.base_url}/shields/register",
+                json={
+                    "shield": json.loads(shield.json()),
+                },
+                headers={"Content-Type": "application/json"},
+            )
+            response.raise_for_status()
+
+    async def get_shield(self, shield_type: str) -> Optional[ShieldDefWithProvider]:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                f"{self.base_url}/shields/get",
@ -49,7 +61,7 @@ class ShieldsClient(Shields):
            if j is None:
                return None

-            return ShieldSpec(**j)
+            return ShieldDefWithProvider(**j)


 async def run_main(host: str, port: int, stream: bool):
--- a/llama_stack/apis/shields/shields.py
+++ b/llama_stack/apis/shields/shields.py
@ -4,25 +4,48 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from typing import List, Optional, Protocol
+from enum import Enum
+from typing import Any, Dict, List, Optional, Protocol, runtime_checkable

 from llama_models.schema_utils import json_schema_type, webmethod
 from pydantic import BaseModel, Field

-from llama_stack.distribution.datatypes import GenericProviderConfig
-

@json_schema_type
-class ShieldSpec(BaseModel):
-    shield_type: str
-    provider_config: GenericProviderConfig = Field(
-        description="Provider config for the model, including provider_type, and corresponding config. ",
+class ShieldType(Enum):
+    generic_content_shield = "generic_content_shield"
+    llama_guard = "llama_guard"
+    code_scanner = "code_scanner"
+    prompt_guard = "prompt_guard"
+
+
+class ShieldDef(BaseModel):
+    identifier: str = Field(
+        description="A unique identifier for the shield type",
+    )
+    type: str = Field(
+        description="The type of shield this is; the value is one of the ShieldType enum"
+    )
+    params: Dict[str, Any] = Field(
+        default_factory=dict,
+        description="Any additional parameters needed for this shield",
    )


+@json_schema_type
+class ShieldDefWithProvider(ShieldDef):
+    provider_id: str = Field(
+        description="The provider ID for this shield type",
+    )
+
+
+@runtime_checkable
 class Shields(Protocol):
    @webmethod(route="/shields/list", method="GET")
-    async def list_shields(self) -> List[ShieldSpec]: ...
+    async def list_shields(self) -> List[ShieldDefWithProvider]: ...

    @webmethod(route="/shields/get", method="GET")
-    async def get_shield(self, shield_type: str) -> Optional[ShieldSpec]: ...
+    async def get_shield(self, shield_type: str) -> Optional[ShieldDefWithProvider]: ...
+
+    @webmethod(route="/shields/register", method="POST")
+    async def register_shield(self, shield: ShieldDefWithProvider) -> None: ...
--- a/llama_stack/apis/telemetry/telemetry.py
+++ b/llama_stack/apis/telemetry/telemetry.py
@ -6,7 +6,7 @@

 from datetime import datetime
 from enum import Enum
-from typing import Any, Dict, Literal, Optional, Protocol, Union
+from typing import Any, Dict, Literal, Optional, Protocol, runtime_checkable, Union

 from llama_models.schema_utils import json_schema_type, webmethod
 from pydantic import BaseModel, Field
@ -123,6 +123,7 @@ Event = Annotated[
 ]


+@runtime_checkable
 class Telemetry(Protocol):
    @webmethod(route="/telemetry/log_event")
    async def log_event(self, event: Event) -> None: ...