API Updates: fleshing out RAG APIs, introduce "llama stack" CLI command (#51)

* add tools to chat completion request * use templates for generating system prompts * Moved ToolPromptFormat and jinja templates to llama_models.llama3.api * <WIP> memory changes - inlined AgenticSystemInstanceConfig so API feels more ergonomic - renamed it to AgentConfig, AgentInstance -> Agent - added a MemoryConfig and `memory` parameter - added `attachments` to input and `output_attachments` to the response - some naming changes * InterleavedTextAttachment -> InterleavedTextMedia, introduce memory tool * flesh out memory banks API * agentic loop has a RAG implementation * faiss provider implementation * memory client works * re-work tool definitions, fix FastAPI issues, fix tool regressions * fix agentic_system utils * basic RAG seems to work * small bug fixes for inline attachments * Refactor custom tool execution utilities * Bug fix, show memory retrieval steps in EventLogger * No need for api_key for Remote providers * add special unicode character ↵ to showcase newlines in model prompt templates * remove api.endpoints imports * combine datatypes.py and endpoints.py into api.py * Attachment / add TTL api * split batch_inference from inference * minor import fixes * use a single impl for ChatFormat.decode_assistant_mesage * use interleaved_text_media_as_str() utilityt * Fix api.datatypes imports * Add blobfile for tiktoken * Add ToolPromptFormat to ChatFormat.encode_message so that tools are encoded properly * templates take optional --format={json,function_tag} * Rag Updates * Add `api build` subcommand -- WIP * fix * build + run image seems to work * <WIP> adapters * bunch more work to make adapters work * api build works for conda now * ollama remote adapter works * Several smaller fixes to make adapters work Also, reorganized the pattern of __init__ inside providers so configuration can stay lightweight * llama distribution -> llama stack + containers (WIP) * All the new CLI for api + stack work * Make Fireworks and Together into the Adapter format * Some quick fixes to the CLI behavior to make it consistent * Updated README phew * Update cli_reference.md * llama_toolchain/distribution -> llama_toolchain/core * Add termcolor * update paths * Add a log just for consistency * chmod +x scripts * Fix api dependencies not getting added to configuration * missing import lol * Delete utils.py; move to agentic system * Support downloading of URLs for attachments for code interpreter * Simplify and generalize `llama api build` yay * Update `llama stack configure` to be very simple also * Fix stack start * Allow building an "adhoc" distribution * Remote `llama api []` subcommands * Fixes to llama stack commands and update docs * Update documentation again and add error messages to llama stack start * llama stack start -> llama stack run * Change name of build for less confusion * Add pyopenapi fork to the repository, update RFC assets * Remove conflicting annotation * Added a "--raw" option for model template printing --------- Co-authored-by: Hardik Shah <hjshah@fb.com> Co-authored-by: Ashwin Bharambe <ashwin@meta.com> Co-authored-by: Dalton Flanagan <6599399+dltn@users.noreply.github.com>
2025-06-27 18:50:41 +00:00 · 2024-09-03 22:39:39 -07:00 · 2024-09-03 22:39:39 -07:00 · 7bc7785b0d
commit 7bc7785b0d
parent 35093c0b6f
141 changed files with 8252 additions and 4032 deletions
--- a/MANIFEST.in
+++ b/MANIFEST.in
@ -1,4 +1,4 @@
 include requirements.txt
 include llama_toolchain/data/*.yaml
-include llama_toolchain/distribution/*.sh
+include llama_toolchain/core/*.sh
 include llama_toolchain/cli/scripts/*.sh
--- a/docs/cli_reference.md
+++ b/docs/cli_reference.md
@ -5,7 +5,7 @@ The `llama` CLI tool helps you setup and use the Llama toolchain & agentic syste
 ### Subcommands
 1. `download`: `llama` cli tools supports downloading the model from Meta or HuggingFace.
 2. `model`: Lists available models and their properties.
-3. `distribution`: A distribution is a set of REST APIs, this command allows you to manage (list, install, create, configure, start) distributions. You can read more about this [here](https://github.com/meta-llama/llama-stack/blob/main/docs/cli_reference.md#step-3-installing-and-configuring-distributions). 
+3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this [here](https://github.com/meta-llama/llama-stack/blob/api_updates_1/docs/cli_reference.md#step-3-building-configuring-and-running-llama-stack-servers).

 ### Sample Usage

@ -13,7 +13,7 @@ The `llama` CLI tool helps you setup and use the Llama toolchain & agentic syste
 llama --help
 ```
 <pre style="font-family: monospace;">
-usage: llama [-h] {download,model,distribution} ...
+usage: llama [-h] {download,model,stack,api} ...

 Welcome to the Llama CLI

@ -21,7 +21,7 @@ options:
  -h, --help            show this help message and exit

 subcommands:
-  {download,model,distribution}
+  {download,model,stack,api}
 </pre>

 ## Step 1. Get the models
@ -236,11 +236,13 @@ These commands can help understand the model interface and how prompts / message
 **NOTE**: Outputs in terminal are color printed to show special tokens.


-## Step 3: Installing and Configuring Distributions
+## Step 3: Building, Configuring and Running Llama Stack servers

 An agentic app has several components including model inference, tool execution and system safety shields. Running all these components is made simpler (we hope!) with Llama Stack Distributions.

-A Distribution is simply a collection of REST API providers that are part of the Llama stack. As an example, by running a simple command `llama distribution start`, you can bring up a server serving the following endpoints, among others:
+The Llama Stack is a collection of REST APIs. An API is _implemented_ by Provider. An assembly of Providers together provides the implementation for the Stack -- this package is called a Distribution.
+
+As an example, by running a simple command `llama stack run`, you can bring up a server serving the following endpoints, among others:
 ```
 POST /inference/chat_completion
 POST /inference/completion
@ -253,103 +255,135 @@ POST /agentic_system/delete

 The agentic app can now simply point to this server to execute all its needed components.

-A distribution’s behavior can be configured by defining a specification or “spec”. This specification lays out the different API “Providers” that constitute this distribution.
+Lets build, configure and start a Llama Stack server specified via a "Distribution ID" to understand more !

-Lets install, configure and start a distribution to understand more !
-
-Let’s start with listing available distributions
+Let’s start with listing available distributions:
 ```
-llama distribution list
+llama stack list-distributions
 ```

 <pre style="font-family: monospace;">
-+--------------+---------------------------------------------+----------------------------------------------------------------------+
-| Spec ID      | ProviderSpecs                               | Description                                                          |
-+--------------+---------------------------------------------+----------------------------------------------------------------------+
+i+--------------------------------+---------------------------------------+----------------------------------------------------------------------+
+| Distribution ID                | Providers                             | Description                                                          |
+--------------------------------+---------------------------------------+----------------------------------------------------------------------+
 | local                          | {                                     | Use code from `llama_toolchain` itself to serve all llama stack APIs |
 |                                |   "inference": "meta-reference",      |                                                                      |
+|                                |   "memory": "meta-reference-faiss",   |                                                                      |
 |                                |   "safety": "meta-reference",         |                                                                      |
 |                                |   "agentic_system": "meta-reference"  |                                                                      |
 |                                | }                                     |                                                                      |
-+--------------+---------------------------------------------+----------------------------------------------------------------------+
+--------------------------------+---------------------------------------+----------------------------------------------------------------------+
 | remote                         | {                                     | Point to remote services for all llama stack APIs                    |
-|              |   "inference": "inference-remote",          |                                                                      |
-|              |   "safety": "safety-remote",                |                                                                      |
-|              |   "agentic_system": "agentic_system-remote" |                                                                      |
+|                                |   "inference": "remote",              |                                                                      |
+|                                |   "safety": "remote",                 |                                                                      |
+|                                |   "agentic_system": "remote",         |                                                                      |
+|                                |   "memory": "remote"                  |                                                                      |
 |                                | }                                     |                                                                      |
-+--------------+---------------------------------------------+----------------------------------------------------------------------+
+--------------------------------+---------------------------------------+----------------------------------------------------------------------+
 | local-ollama                   | {                                     | Like local, but use ollama for running LLM inference                 |
-|              |   "inference": "meta-ollama",               |                                                                      |
+|                                |   "inference": "remote::ollama",      |                                                                      |
 |                                |   "safety": "meta-reference",         |                                                                      |
-|              |   "agentic_system": "meta-reference"        |                                                                      |
+|                                |   "agentic_system": "meta-reference", |                                                                      |
+|                                |   "memory": "meta-reference-faiss"    |                                                                      |
 |                                | }                                     |                                                                      |
-+--------------+---------------------------------------------+----------------------------------------------------------------------+
+--------------------------------+---------------------------------------+----------------------------------------------------------------------+
+| local-plus-fireworks-inference | {                                     | Use Fireworks.ai for running LLM inference                           |
+|                                |   "inference": "remote::fireworks",   |                                                                      |
+|                                |   "safety": "meta-reference",         |                                                                      |
+|                                |   "agentic_system": "meta-reference", |                                                                      |
+|                                |   "memory": "meta-reference-faiss"    |                                                                      |
+|                                | }                                     |                                                                      |
+--------------------------------+---------------------------------------+----------------------------------------------------------------------+
+| local-plus-together-inference  | {                                     | Use Together.ai for running LLM inference                            |
+|                                |   "inference": "remote::together",    |                                                                      |
+|                                |   "safety": "meta-reference",         |                                                                      |
+|                                |   "agentic_system": "meta-reference", |                                                                      |
+|                                |   "memory": "meta-reference-faiss"    |                                                                      |
+|                                | }                                     |                                                                      |
+--------------------------------+---------------------------------------+----------------------------------------------------------------------+
 </pre>

-As you can see above, each “spec” details the “providers” that make up that spec. For eg. The `local` spec uses the “meta-reference” provider for inference while the `local-ollama` spec relies on a different provider ( ollama ) for inference.
+As you can see above, each “distribution” details the “providers” it is composed of. For example, `local` uses the “meta-reference” provider for inference while local-ollama relies on a different provider (Ollama) for inference. Similarly, you can use Fireworks or Together.AI for running inference as well.

-Lets install the fully local implementation of the llama-stack – named `local` above.
+To install a distribution, we run a simple command providing 2 inputs:
+- **Distribution Id** of the distribution that we want to install ( as obtained from the list-distributions command )
+- A **Name** for the specific build and configuration of this distribution.

-To install a distro, we run a simple command providing 2 inputs –
- **Spec Id** of the distribution that we want to install ( as obtained from the list command )
- A **Name** by which this installation will be known locally.
+Let's imagine you are working with a 8B-Instruct model. The following command will build a package (in the form of a Conda environment) _and_ configure it. As part of the configuration, you will be asked for some inputs (model_id, max_seq_len, etc.) Since we are working with a 8B model, we will name our build `8b-instruct` to help us remember the config.

 ```
-llama distribution install --spec local --name local_llama_8b
+llama stack build local --name 8b-instruct
 ```

-This will create a new conda environment (name can be passed optionally) and install dependencies (via pip) as required by the distro.
-
-Once it runs successfully , you should see some outputs in the form
+Once it runs successfully , you should see some outputs in the form:

 ```
-llama distribution install --spec local --name local_llama_8b
-```
-<pre style="font-family: monospace;">
+$ llama stack build local --name 8b-instruct
+....
+....
 Successfully installed cfgv-3.4.0 distlib-0.3.8 identify-2.6.0 libcst-1.4.0 llama_toolchain-0.0.2 moreorless-0.4.0 nodeenv-1.9.1 pre-commit-3.8.0 stdlibs-2024.5.15 toml-0.10.2 tomlkit-0.13.0 trailrunner-1.4.0 ufmt-2.7.0 usort-1.0.8 virtualenv-20.26.3

-Distribution `local_llama_8b` (with spec local) has been installed successfully!
-</pre>
+Successfully setup conda environment. Configuring build...

-Next step is to configure the distribution that you just installed. We provide a simple CLI tool to enable simple configuration.
-This command will walk you through the configuration process.
-It will ask for some details like model name, paths to models, etc.
+...
+...

-**NOTE**: You will have to download the models if not done already. Follow instructions here on how to download using the llama cli
-```
-llama distribution configure --name local_llama_8b
+YAML configuration has been written to ~/.llama/builds/local/conda/8b-instruct.yaml
 ```

-Here is an example output of how the cli will guide you to fill the configuration:
-<pre style="font-family: monospace;">
-Configuring API surface: inference
+You can re-configure this distribution by running:
+```
+llama stack configure local --name 8b-instruct
+```
+
+Here is an example run of how the CLI will guide you to fill the configuration
+```
+$ llama stack configure local --name 8b-instruct
+
+Configuring API: inference (meta-reference)
 Enter value for model (required): Meta-Llama3.1-8B-Instruct
 Enter value for quantization (optional):
 Enter value for torch_seed (optional):
 Enter value for max_seq_len (required): 4096
 Enter value for max_batch_size (default: 1): 1
-Configuring API surface: safety
-Do you want to configure llama_guard_shield? (y/n): n
-Do you want to configure prompt_guard_shield? (y/n): n
-Configuring API surface: agentic_system
+Configuring API: safety (meta-reference)
+Do you want to configure llama_guard_shield? (y/n): y
+Entering sub-configuration for llama_guard_shield:
+Enter value for model (required): Llama-Guard-3-8B
+Enter value for excluded_categories (required): []
+Enter value for disable_input_check (default: False):
+Enter value for disable_output_check (default: False):
+Do you want to configure prompt_guard_shield? (y/n): y
+Entering sub-configuration for prompt_guard_shield:
+Enter value for model (required): Prompt-Guard-86M
+...
+...
+YAML configuration has been written to ~/.llama/builds/local/conda/8b-instruct.yaml
+```

-YAML configuration has been written to ~/.llama/distributions/local0/config.yaml
-</pre>
-
-As you can see, we did basic configuration above and configured inference to run on model Meta-Llama3.1-8B-Instruct ( obtained from the llama model list command ).
-For this initial setup we did not set up safety.
+As you can see, we did basic configuration above and configured:
+- inference to run on model `Meta-Llama3.1-8B-Instruct` (obtained from `llama model list`)
+- Llama Guard safety shield with model `Llama-Guard-3-8B`
+- Prompt Guard safety shield with model `Prompt-Guard-86M`

 For how these configurations are stored as yaml, checkout the file printed at the end of the configuration.

-## Step 4: Starting a Distribution and Testing it
+Note that all configurations as well as models are stored in `~/.llama`

-Now let’s start the distribution using the cli.
-```
-llama distribution start --name local_llama_8b --port 5000
-```
-You should see the distribution start and print the APIs that it is supporting:
+## Step 4: Starting a Llama Stack Distribution and Testing it
+
+Now let’s start Llama Stack server.
+
+You need the YAML configuration file which was written out at the end by the `llama stack build` step.
+
+```
+llama stack run local --name 8b-instruct --port 5000
+```
+You should see the Stack server start and print the APIs that it is supporting,
+
+```
+$ llama stack run local --name 8b-instruct --port 5000

-<pre style="font-family: monospace;">
 > initializing model parallel with size 1
 > initializing ddp with size 1
 > initializing pipeline with size 1
@ -376,13 +410,21 @@ INFO:     Started server process [453333]
 INFO:     Waiting for application startup.
 INFO:     Application startup complete.
 INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
-</pre>
-
-Lets test with a client
-
 ```
-cd /path/to/llama-toolchain
-conda activate <env-for-distribution> # ( Eg. local_llama_8b in above example )
+
+
+> [!NOTE]
+> Configuration is in `~/.llama/builds/local/conda/8b-instruct.yaml`. Feel free to increase `max_seq_len`.
+
+> [!IMPORTANT]
+> The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines.
+
+This server is running a Llama model locally.
+
+Lets test with a client.
+```
+cd /path/to/llama-stack
+conda activate <env>  # any environment containing the llama-toolchain pip package will work

 python -m llama_toolchain.inference.client localhost 5000
 ```
--- a/llama_toolchain/agentic_system/api/init.py
+++ b/llama_toolchain/agentic_system/api/init.py
@ -4,5 +4,4 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from .datatypes import *  # noqa
-from .endpoints import *  # noqa
+from .api import *  # noqa: F401 F403
--- a/llama_toolchain/agentic_system/api/api.py
+++ b/llama_toolchain/agentic_system/api/api.py
@ -0,0 +1,413 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from datetime import datetime
+from enum import Enum
+from typing import Any, Dict, List, Literal, Optional, Protocol, Union
+
+from llama_models.schema_utils import json_schema_type, webmethod
+
+from pydantic import BaseModel, ConfigDict, Field
+from typing_extensions import Annotated
+
+from llama_models.llama3.api.datatypes import *  # noqa: F403
+from llama_toolchain.common.deployment_types import *  # noqa: F403
+from llama_toolchain.inference.api import *  # noqa: F403
+from llama_toolchain.safety.api import *  # noqa: F403
+from llama_toolchain.memory.api import *  # noqa: F403
+
+
+@json_schema_type
+class Attachment(BaseModel):
+    content: InterleavedTextMedia | URL
+    mime_type: str
+
+
+class AgenticSystemTool(Enum):
+    brave_search = "brave_search"
+    wolfram_alpha = "wolfram_alpha"
+    photogen = "photogen"
+    code_interpreter = "code_interpreter"
+
+    function_call = "function_call"
+    memory = "memory"
+
+
+class ToolDefinitionCommon(BaseModel):
+    input_shields: Optional[List[ShieldDefinition]] = Field(default_factory=list)
+    output_shields: Optional[List[ShieldDefinition]] = Field(default_factory=list)
+
+
+@json_schema_type
+class BraveSearchToolDefinition(ToolDefinitionCommon):
+    type: Literal[AgenticSystemTool.brave_search.value] = (
+        AgenticSystemTool.brave_search.value
+    )
+    remote_execution: Optional[RestAPIExecutionConfig] = None
+
+
+@json_schema_type
+class WolframAlphaToolDefinition(ToolDefinitionCommon):
+    type: Literal[AgenticSystemTool.wolfram_alpha.value] = (
+        AgenticSystemTool.wolfram_alpha.value
+    )
+    remote_execution: Optional[RestAPIExecutionConfig] = None
+
+
+@json_schema_type
+class PhotogenToolDefinition(ToolDefinitionCommon):
+    type: Literal[AgenticSystemTool.photogen.value] = AgenticSystemTool.photogen.value
+    remote_execution: Optional[RestAPIExecutionConfig] = None
+
+
+@json_schema_type
+class CodeInterpreterToolDefinition(ToolDefinitionCommon):
+    type: Literal[AgenticSystemTool.code_interpreter.value] = (
+        AgenticSystemTool.code_interpreter.value
+    )
+    enable_inline_code_execution: bool = True
+    remote_execution: Optional[RestAPIExecutionConfig] = None
+
+
+@json_schema_type
+class FunctionCallToolDefinition(ToolDefinitionCommon):
+    type: Literal[AgenticSystemTool.function_call.value] = (
+        AgenticSystemTool.function_call.value
+    )
+    function_name: str
+    description: str
+    parameters: Dict[str, ToolParamDefinition]
+    remote_execution: Optional[RestAPIExecutionConfig] = None
+
+
+class _MemoryBankConfigCommon(BaseModel):
+    bank_id: str
+
+
+class AgenticSystemVectorMemoryBankConfig(_MemoryBankConfigCommon):
+    type: Literal[MemoryBankType.vector.value] = MemoryBankType.vector.value
+
+
+class AgenticSystemKeyValueMemoryBankConfig(_MemoryBankConfigCommon):
+    type: Literal[MemoryBankType.keyvalue.value] = MemoryBankType.keyvalue.value
+    keys: List[str]  # what keys to focus on
+
+
+class AgenticSystemKeywordMemoryBankConfig(_MemoryBankConfigCommon):
+    type: Literal[MemoryBankType.keyword.value] = MemoryBankType.keyword.value
+
+
+class AgenticSystemGraphMemoryBankConfig(_MemoryBankConfigCommon):
+    type: Literal[MemoryBankType.graph.value] = MemoryBankType.graph.value
+    entities: List[str]  # what entities to focus on
+
+
+MemoryBankConfig = Annotated[
+    Union[
+        AgenticSystemVectorMemoryBankConfig,
+        AgenticSystemKeyValueMemoryBankConfig,
+        AgenticSystemKeywordMemoryBankConfig,
+        AgenticSystemGraphMemoryBankConfig,
+    ],
+    Field(discriminator="type"),
+]
+
+
+@json_schema_type
+class MemoryToolDefinition(ToolDefinitionCommon):
+    type: Literal[AgenticSystemTool.memory.value] = AgenticSystemTool.memory.value
+    memory_bank_configs: List[MemoryBankConfig] = Field(default_factory=list)
+    max_tokens_in_context: int = 4096
+    max_chunks: int = 10
+
+
+AgenticSystemToolDefinition = Annotated[
+    Union[
+        BraveSearchToolDefinition,
+        WolframAlphaToolDefinition,
+        PhotogenToolDefinition,
+        CodeInterpreterToolDefinition,
+        FunctionCallToolDefinition,
+        MemoryToolDefinition,
+    ],
+    Field(discriminator="type"),
+]
+
+
+class StepCommon(BaseModel):
+    turn_id: str
+    step_id: str
+    started_at: Optional[datetime] = None
+    completed_at: Optional[datetime] = None
+
+
+class StepType(Enum):
+    inference = "inference"
+    tool_execution = "tool_execution"
+    shield_call = "shield_call"
+    memory_retrieval = "memory_retrieval"
+
+
+@json_schema_type
+class InferenceStep(StepCommon):
+    model_config = ConfigDict(protected_namespaces=())
+
+    step_type: Literal[StepType.inference.value] = StepType.inference.value
+    model_response: CompletionMessage
+
+
+@json_schema_type
+class ToolExecutionStep(StepCommon):
+    step_type: Literal[StepType.tool_execution.value] = StepType.tool_execution.value
+    tool_calls: List[ToolCall]
+    tool_responses: List[ToolResponse]
+
+
+@json_schema_type
+class ShieldCallStep(StepCommon):
+    step_type: Literal[StepType.shield_call.value] = StepType.shield_call.value
+    response: ShieldResponse
+
+
+@json_schema_type
+class MemoryRetrievalStep(StepCommon):
+    step_type: Literal[StepType.memory_retrieval.value] = (
+        StepType.memory_retrieval.value
+    )
+    memory_bank_ids: List[str]
+    inserted_context: InterleavedTextMedia
+
+
+Step = Annotated[
+    Union[
+        InferenceStep,
+        ToolExecutionStep,
+        ShieldCallStep,
+        MemoryRetrievalStep,
+    ],
+    Field(discriminator="step_type"),
+]
+
+
+@json_schema_type
+class Turn(BaseModel):
+    """A single turn in an interaction with an Agentic System."""
+
+    turn_id: str
+    session_id: str
+    input_messages: List[
+        Union[
+            UserMessage,
+            ToolResponseMessage,
+        ]
+    ]
+    steps: List[Step]
+    output_message: CompletionMessage
+    output_attachments: List[Attachment] = Field(default_factory=list)
+
+    started_at: datetime
+    completed_at: Optional[datetime] = None
+
+
+@json_schema_type
+class Session(BaseModel):
+    """A single session of an interaction with an Agentic System."""
+
+    session_id: str
+    session_name: str
+    turns: List[Turn]
+    started_at: datetime
+
+    memory_bank: Optional[MemoryBank] = None
+
+
+class AgentConfigCommon(BaseModel):
+    sampling_params: Optional[SamplingParams] = SamplingParams()
+
+    input_shields: Optional[List[ShieldDefinition]] = Field(default_factory=list)
+    output_shields: Optional[List[ShieldDefinition]] = Field(default_factory=list)
+
+    tools: Optional[List[AgenticSystemToolDefinition]] = Field(default_factory=list)
+    tool_choice: Optional[ToolChoice] = Field(default=ToolChoice.auto)
+    tool_prompt_format: Optional[ToolPromptFormat] = Field(
+        default=ToolPromptFormat.json
+    )
+
+
+@json_schema_type
+class AgentConfig(AgentConfigCommon):
+    model: str
+    instructions: str
+
+
+class AgentConfigOverridablePerTurn(AgentConfigCommon):
+    instructions: Optional[str] = None
+
+
+class AgenticSystemTurnResponseEventType(Enum):
+    step_start = "step_start"
+    step_complete = "step_complete"
+    step_progress = "step_progress"
+
+    turn_start = "turn_start"
+    turn_complete = "turn_complete"
+
+
+@json_schema_type
+class AgenticSystemTurnResponseStepStartPayload(BaseModel):
+    event_type: Literal[AgenticSystemTurnResponseEventType.step_start.value] = (
+        AgenticSystemTurnResponseEventType.step_start.value
+    )
+    step_type: StepType
+    step_id: str
+    metadata: Optional[Dict[str, Any]] = Field(default_factory=dict)
+
+
+@json_schema_type
+class AgenticSystemTurnResponseStepCompletePayload(BaseModel):
+    event_type: Literal[AgenticSystemTurnResponseEventType.step_complete.value] = (
+        AgenticSystemTurnResponseEventType.step_complete.value
+    )
+    step_type: StepType
+    step_details: Step
+
+
+@json_schema_type
+class AgenticSystemTurnResponseStepProgressPayload(BaseModel):
+    model_config = ConfigDict(protected_namespaces=())
+
+    event_type: Literal[AgenticSystemTurnResponseEventType.step_progress.value] = (
+        AgenticSystemTurnResponseEventType.step_progress.value
+    )
+    step_type: StepType
+    step_id: str
+
+    model_response_text_delta: Optional[str] = None
+    tool_call_delta: Optional[ToolCallDelta] = None
+    tool_response_text_delta: Optional[str] = None
+
+
+@json_schema_type
+class AgenticSystemTurnResponseTurnStartPayload(BaseModel):
+    event_type: Literal[AgenticSystemTurnResponseEventType.turn_start.value] = (
+        AgenticSystemTurnResponseEventType.turn_start.value
+    )
+    turn_id: str
+
+
+@json_schema_type
+class AgenticSystemTurnResponseTurnCompletePayload(BaseModel):
+    event_type: Literal[AgenticSystemTurnResponseEventType.turn_complete.value] = (
+        AgenticSystemTurnResponseEventType.turn_complete.value
+    )
+    turn: Turn
+
+
+@json_schema_type
+class AgenticSystemTurnResponseEvent(BaseModel):
+    """Streamed agent execution response."""
+
+    payload: Annotated[
+        Union[
+            AgenticSystemTurnResponseStepStartPayload,
+            AgenticSystemTurnResponseStepProgressPayload,
+            AgenticSystemTurnResponseStepCompletePayload,
+            AgenticSystemTurnResponseTurnStartPayload,
+            AgenticSystemTurnResponseTurnCompletePayload,
+        ],
+        Field(discriminator="event_type"),
+    ]
+
+
+@json_schema_type
+class AgenticSystemCreateResponse(BaseModel):
+    agent_id: str
+
+
+@json_schema_type
+class AgenticSystemSessionCreateResponse(BaseModel):
+    session_id: str
+
+
+@json_schema_type
+class AgenticSystemTurnCreateRequest(AgentConfigOverridablePerTurn):
+    agent_id: str
+    session_id: str
+
+    # TODO: figure out how we can simplify this and make why
+    # ToolResponseMessage needs to be here (it is function call
+    # execution from outside the system)
+    messages: List[
+        Union[
+            UserMessage,
+            ToolResponseMessage,
+        ]
+    ]
+    attachments: Optional[List[Attachment]] = None
+
+    stream: Optional[bool] = False
+
+
+@json_schema_type
+class AgenticSystemTurnResponseStreamChunk(BaseModel):
+    event: AgenticSystemTurnResponseEvent
+
+
+@json_schema_type
+class AgenticSystemStepResponse(BaseModel):
+    step: Step
+
+
+class AgenticSystem(Protocol):
+    @webmethod(route="/agentic_system/create")
+    async def create_agentic_system(
+        self,
+        agent_config: AgentConfig,
+    ) -> AgenticSystemCreateResponse: ...
+
+    @webmethod(route="/agentic_system/turn/create")
+    async def create_agentic_system_turn(
+        self,
+        request: AgenticSystemTurnCreateRequest,
+    ) -> AgenticSystemTurnResponseStreamChunk: ...
+
+    @webmethod(route="/agentic_system/turn/get")
+    async def get_agentic_system_turn(
+        self,
+        agent_id: str,
+        turn_id: str,
+    ) -> Turn: ...
+
+    @webmethod(route="/agentic_system/step/get")
+    async def get_agentic_system_step(
+        self, agent_id: str, turn_id: str, step_id: str
+    ) -> AgenticSystemStepResponse: ...
+
+    @webmethod(route="/agentic_system/session/create")
+    async def create_agentic_system_session(
+        self,
+        agent_id: str,
+        session_name: str,
+    ) -> AgenticSystemSessionCreateResponse: ...
+
+    @webmethod(route="/agentic_system/session/get")
+    async def get_agentic_system_session(
+        self,
+        agent_id: str,
+        session_id: str,
+        turn_ids: Optional[List[str]] = None,
+    ) -> Session: ...
+
+    @webmethod(route="/agentic_system/session/delete")
+    async def delete_agentic_system_session(
+        self, agent_id: str, session_id: str
+    ) -> None: ...
+
+    @webmethod(route="/agentic_system/delete")
+    async def delete_agentic_system(
+        self,
+        agent_id: str,
+    ) -> None: ...
--- a/llama_toolchain/agentic_system/api/datatypes.py
+++ b/llama_toolchain/agentic_system/api/datatypes.py
@ -1,234 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from datetime import datetime
-from enum import Enum
-from typing import Any, Dict, List, Literal, Optional, Union
-
-from llama_models.schema_utils import json_schema_type
-
-from pydantic import BaseModel, ConfigDict, Field
-from typing_extensions import Annotated
-
-from llama_toolchain.common.deployment_types import *  # noqa: F403
-from llama_toolchain.inference.api import *  # noqa: F403
-from llama_toolchain.safety.api.datatypes import *  # noqa: F403
-from llama_toolchain.memory.api.datatypes import *  # noqa: F403
-
-
-@json_schema_type
-class AgenticSystemToolDefinition(ToolDefinition):
-    execution_config: Optional[RestAPIExecutionConfig] = None
-    input_shields: Optional[List[ShieldDefinition]] = Field(default_factory=list)
-    output_shields: Optional[List[ShieldDefinition]] = Field(default_factory=list)
-
-
-class StepCommon(BaseModel):
-    turn_id: str
-    step_id: str
-    started_at: Optional[datetime] = None
-    completed_at: Optional[datetime] = None
-
-
-class StepType(Enum):
-    inference = "inference"
-    tool_execution = "tool_execution"
-    shield_call = "shield_call"
-    memory_retrieval = "memory_retrieval"
-
-
-@json_schema_type
-class InferenceStep(StepCommon):
-    model_config = ConfigDict(protected_namespaces=())
-
-    step_type: Literal[StepType.inference.value] = StepType.inference.value
-    model_response: CompletionMessage
-
-
-@json_schema_type
-class ToolExecutionStep(StepCommon):
-    step_type: Literal[StepType.tool_execution.value] = StepType.tool_execution.value
-    tool_calls: List[ToolCall]
-    tool_responses: List[ToolResponse]
-
-
-@json_schema_type
-class ShieldCallStep(StepCommon):
-    step_type: Literal[StepType.shield_call.value] = StepType.shield_call.value
-    response: ShieldResponse
-
-
-@json_schema_type
-class MemoryRetrievalStep(StepCommon):
-    step_type: Literal[StepType.memory_retrieval.value] = (
-        StepType.memory_retrieval.value
-    )
-    memory_bank_ids: List[str]
-    documents: List[MemoryBankDocument]
-    scores: List[float]
-
-
-Step = Annotated[
-    Union[
-        InferenceStep,
-        ToolExecutionStep,
-        ShieldCallStep,
-        MemoryRetrievalStep,
-    ],
-    Field(discriminator="step_type"),
-]
-
-
-@json_schema_type
-class Turn(BaseModel):
-    """A single turn in an interaction with an Agentic System."""
-
-    turn_id: str
-    session_id: str
-    input_messages: List[
-        Union[
-            UserMessage,
-            ToolResponseMessage,
-        ]
-    ]
-    steps: List[Step]
-    output_message: CompletionMessage
-    started_at: datetime
-    completed_at: Optional[datetime] = None
-
-
-@json_schema_type
-class Session(BaseModel):
-    """A single session of an interaction with an Agentic System."""
-
-    session_id: str
-    session_name: str
-    turns: List[Turn]
-    started_at: datetime
-
-
-@json_schema_type
-class ToolPromptFormat(Enum):
-    """This Enum refers to the prompt format for calling zero shot tools
-
-    `json` --
-        Refers to the json format for calling tools.
-        The json format takes the form like
-        {
-            "type": "function",
-            "function" : {
-                "name": "function_name",
-                "description": "function_description",
-                "parameters": {...}
-            }
-        }
-
-    `function_tag` --
-        This is an example of how you could define
-        your own user defined format for making tool calls.
-        The function_tag format looks like this,
-        <function=function_name>(parameters)</function>
-
-    The detailed prompts for each of these formats are defined in `system_prompt.py`
-    """
-
-    json = "json"
-    function_tag = "function_tag"
-
-
-@json_schema_type
-class AgenticSystemInstanceConfig(BaseModel):
-    instructions: str
-    sampling_params: Optional[SamplingParams] = SamplingParams()
-    # zero-shot or built-in tool configurations as input to the model
-    available_tools: Optional[List[AgenticSystemToolDefinition]] = Field(
-        default_factory=list
-    )
-
-    input_shields: Optional[List[ShieldDefinition]] = Field(default_factory=list)
-    output_shields: Optional[List[ShieldDefinition]] = Field(default_factory=list)
-
-    # if you completely want to replace the messages prefixed by the system,
-    # this is debug only
-    debug_prefix_messages: Optional[List[Message]] = Field(default_factory=list)
-    tool_prompt_format: Optional[ToolPromptFormat] = Field(
-        default=ToolPromptFormat.json
-    )
-
-
-class AgenticSystemTurnResponseEventType(Enum):
-    step_start = "step_start"
-    step_complete = "step_complete"
-    step_progress = "step_progress"
-
-    turn_start = "turn_start"
-    turn_complete = "turn_complete"
-
-
-@json_schema_type
-class AgenticSystemTurnResponseStepStartPayload(BaseModel):
-    event_type: Literal[AgenticSystemTurnResponseEventType.step_start.value] = (
-        AgenticSystemTurnResponseEventType.step_start.value
-    )
-    step_type: StepType
-    step_id: str
-    metadata: Optional[Dict[str, Any]] = Field(default_factory=dict)
-
-
-@json_schema_type
-class AgenticSystemTurnResponseStepCompletePayload(BaseModel):
-    event_type: Literal[AgenticSystemTurnResponseEventType.step_complete.value] = (
-        AgenticSystemTurnResponseEventType.step_complete.value
-    )
-    step_type: StepType
-    step_details: Step
-
-
-@json_schema_type
-class AgenticSystemTurnResponseStepProgressPayload(BaseModel):
-    model_config = ConfigDict(protected_namespaces=())
-
-    event_type: Literal[AgenticSystemTurnResponseEventType.step_progress.value] = (
-        AgenticSystemTurnResponseEventType.step_progress.value
-    )
-    step_type: StepType
-    step_id: str
-
-    model_response_text_delta: Optional[str] = None
-    tool_call_delta: Optional[ToolCallDelta] = None
-    tool_response_text_delta: Optional[str] = None
-
-
-@json_schema_type
-class AgenticSystemTurnResponseTurnStartPayload(BaseModel):
-    event_type: Literal[AgenticSystemTurnResponseEventType.turn_start.value] = (
-        AgenticSystemTurnResponseEventType.turn_start.value
-    )
-    turn_id: str
-
-
-@json_schema_type
-class AgenticSystemTurnResponseTurnCompletePayload(BaseModel):
-    event_type: Literal[AgenticSystemTurnResponseEventType.turn_complete.value] = (
-        AgenticSystemTurnResponseEventType.turn_complete.value
-    )
-    turn: Turn
-
-
-@json_schema_type
-class AgenticSystemTurnResponseEvent(BaseModel):
-    """Streamed agent execution response."""
-
-    payload: Annotated[
-        Union[
-            AgenticSystemTurnResponseStepStartPayload,
-            AgenticSystemTurnResponseStepProgressPayload,
-            AgenticSystemTurnResponseStepCompletePayload,
-            AgenticSystemTurnResponseTurnStartPayload,
-            AgenticSystemTurnResponseTurnCompletePayload,
-        ],
-        Field(discriminator="event_type"),
-    ]
--- a/llama_toolchain/agentic_system/api/endpoints.py
+++ b/llama_toolchain/agentic_system/api/endpoints.py
@ -1,127 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from .datatypes import *  # noqa: F403
-from typing import Protocol
-
-# this dependency is annoying and we need a forked up version anyway
-from llama_models.schema_utils import json_schema_type, webmethod
-
-
-@json_schema_type
-class AgenticSystemCreateRequest(BaseModel):
-    model: str
-    instance_config: AgenticSystemInstanceConfig
-
-
-@json_schema_type
-class AgenticSystemCreateResponse(BaseModel):
-    system_id: str
-
-
-@json_schema_type
-class AgenticSystemSessionCreateRequest(BaseModel):
-    system_id: str
-    session_name: str
-
-
-@json_schema_type
-class AgenticSystemSessionCreateResponse(BaseModel):
-    session_id: str
-
-
-@json_schema_type
-# what's the URI?
-class AgenticSystemTurnCreateRequest(BaseModel):
-    system_id: str
-    session_id: str
-
-    messages: List[
-        Union[
-            UserMessage,
-            ToolResponseMessage,
-        ]
-    ]
-
-    stream: Optional[bool] = False
-    override_config: Optional[AgenticSystemInstanceConfig] = None
-
-
-@json_schema_type
-class AgenticSystemTurnResponseStreamChunk(BaseModel):
-    event: AgenticSystemTurnResponseEvent
-
-
-@json_schema_type
-class AgenticSystemStepResponse(BaseModel):
-    step: Step
-
-
-class AgenticSystem(Protocol):
-    @webmethod(route="/agentic_system/create")
-    async def create_agentic_system(
-        self,
-        request: AgenticSystemCreateRequest,
-    ) -> AgenticSystemCreateResponse: ...
-
-    @webmethod(route="/agentic_system/turn/create")
-    async def create_agentic_system_turn(
-        self,
-        request: AgenticSystemTurnCreateRequest,
-    ) -> AgenticSystemTurnResponseStreamChunk: ...
-
-    @webmethod(route="/agentic_system/turn/get")
-    async def get_agentic_system_turn(
-        self,
-        agent_id: str,
-        turn_id: str,
-    ) -> Turn: ...
-
-    @webmethod(route="/agentic_system/step/get")
-    async def get_agentic_system_step(
-        self, agent_id: str, turn_id: str, step_id: str
-    ) -> AgenticSystemStepResponse: ...
-
-    @webmethod(route="/agentic_system/session/create")
-    async def create_agentic_system_session(
-        self,
-        request: AgenticSystemSessionCreateRequest,
-    ) -> AgenticSystemSessionCreateResponse: ...
-
-    @webmethod(route="/agentic_system/memory_bank/attach")
-    async def attach_memory_bank_to_agentic_system(
-        self,
-        agent_id: str,
-        session_id: str,
-        memory_bank_ids: List[str],
-    ) -> None: ...
-
-    @webmethod(route="/agentic_system/memory_bank/detach")
-    async def detach_memory_bank_from_agentic_system(
-        self,
-        agent_id: str,
-        session_id: str,
-        memory_bank_ids: List[str],
-    ) -> None: ...
-
-    @webmethod(route="/agentic_system/session/get")
-    async def get_agentic_system_session(
-        self,
-        agent_id: str,
-        session_id: str,
-        turn_ids: Optional[List[str]] = None,
-    ) -> Session: ...
-
-    @webmethod(route="/agentic_system/session/delete")
-    async def delete_agentic_system_session(
-        self, agent_id: str, session_id: str
-    ) -> None: ...
-
-    @webmethod(route="/agentic_system/delete")
-    async def delete_agentic_system(
-        self,
-        agent_id: str,
-    ) -> None: ...
--- a/llama_toolchain/agentic_system/client.py
+++ b/llama_toolchain/agentic_system/client.py
@ -6,38 +6,28 @@

 import asyncio
 import json
-
 from typing import AsyncGenerator

 import fire

 import httpx

-from llama_models.llama3.api.datatypes import (
-    BuiltinTool,
-    SamplingParams,
-    ToolParamDefinition,
-    UserMessage,
-)
+from pydantic import BaseModel
 from termcolor import cprint

-from llama_toolchain.agentic_system.event_logger import EventLogger
-from .api import (
-    AgenticSystem,
-    AgenticSystemCreateRequest,
-    AgenticSystemCreateResponse,
-    AgenticSystemInstanceConfig,
-    AgenticSystemSessionCreateRequest,
-    AgenticSystemSessionCreateResponse,
-    AgenticSystemToolDefinition,
-    AgenticSystemTurnCreateRequest,
-    AgenticSystemTurnResponseStreamChunk,
-    ToolPromptFormat,
-)
+from llama_models.llama3.api.datatypes import *  # noqa: F403
+from llama_toolchain.core.datatypes import RemoteProviderConfig
+
+from .api import *  # noqa: F403
+from .event_logger import EventLogger


-async def get_client_impl(base_url: str):
-    return AgenticSystemClient(base_url)
+async def get_client_impl(config: RemoteProviderConfig, _deps):
+    return AgenticSystemClient(config.url)
+
+
+def encodable_dict(d: BaseModel):
+    return json.loads(d.json())


 class AgenticSystemClient(AgenticSystem):
@ -45,12 +35,14 @@ class AgenticSystemClient(AgenticSystem):
        self.base_url = base_url

    async def create_agentic_system(
-        self, request: AgenticSystemCreateRequest
+        self, agent_config: AgentConfig
    ) -> AgenticSystemCreateResponse:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.base_url}/agentic_system/create",
-                data=request.json(),
+                json={
+                    "agent_config": encodable_dict(agent_config),
+                },
                headers={"Content-Type": "application/json"},
            )
            response.raise_for_status()
@ -58,12 +50,16 @@ class AgenticSystemClient(AgenticSystem):

    async def create_agentic_system_session(
        self,
-        request: AgenticSystemSessionCreateRequest,
+        agent_id: str,
+        session_name: str,
    ) -> AgenticSystemSessionCreateResponse:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.base_url}/agentic_system/session/create",
-                data=request.json(),
+                json={
+                    "agent_id": agent_id,
+                    "session_name": session_name,
+                },
                headers={"Content-Type": "application/json"},
            )
            response.raise_for_status()
@ -77,7 +73,9 @@ class AgenticSystemClient(AgenticSystem):
            async with client.stream(
                "POST",
                f"{self.base_url}/agentic_system/turn/create",
-                data=request.json(),
+                json={
+                    "request": encodable_dict(request),
+                },
                headers={"Content-Type": "application/json"},
                timeout=20,
            ) as response:
@ -85,6 +83,10 @@ class AgenticSystemClient(AgenticSystem):
                    if line.startswith("data:"):
                        data = line[len("data: ") :]
                        try:
+                            if "error" in data:
+                                cprint(data, "red")
+                                continue
+
                            yield AgenticSystemTurnResponseStreamChunk(
                                **json.loads(data)
                            )
@ -93,24 +95,52 @@ class AgenticSystemClient(AgenticSystem):
                            print(f"Error with parsing or validation: {e}")


+async def _run_agent(api, tool_definitions, user_prompts, attachments=None):
+    agent_config = AgentConfig(
+        model="Meta-Llama3.1-8B-Instruct",
+        instructions="You are a helpful assistant",
+        sampling_params=SamplingParams(temperature=1.0, top_p=0.9),
+        tools=tool_definitions,
+        tool_choice=ToolChoice.auto,
+        tool_prompt_format=ToolPromptFormat.function_tag,
+    )
+
+    create_response = await api.create_agentic_system(agent_config)
+    session_response = await api.create_agentic_system_session(
+        agent_id=create_response.agent_id,
+        session_name="test_session",
+    )
+
+    for content in user_prompts:
+        cprint(f"User> {content}", color="white", attrs=["bold"])
+        iterator = api.create_agentic_system_turn(
+            AgenticSystemTurnCreateRequest(
+                agent_id=create_response.agent_id,
+                session_id=session_response.session_id,
+                messages=[
+                    UserMessage(content=content),
+                ],
+                attachments=attachments,
+                stream=True,
+            )
+        )
+
+        async for event, log in EventLogger().log(iterator):
+            if log is not None:
+                log.print()
+
+
 async def run_main(host: str, port: int):
-    # client to test remote impl of agentic system
    api = AgenticSystemClient(f"http://{host}:{port}")

    tool_definitions = [
-        AgenticSystemToolDefinition(
-            tool_name=BuiltinTool.brave_search,
-        ),
-        AgenticSystemToolDefinition(
-            tool_name=BuiltinTool.wolfram_alpha,
-        ),
-        AgenticSystemToolDefinition(
-            tool_name=BuiltinTool.code_interpreter,
-        ),
+        BraveSearchToolDefinition(),
+        WolframAlphaToolDefinition(),
+        CodeInterpreterToolDefinition(),
    ]
    tool_definitions += [
-        AgenticSystemToolDefinition(
-            tool_name="get_boiling_point",
+        FunctionCallToolDefinition(
+            function_name="get_boiling_point",
            description="Get the boiling point of a imaginary liquids (eg. polyjuice)",
            parameters={
                "liquid_name": ToolParamDefinition(
@ -127,30 +157,6 @@ async def run_main(host: str, port: int):
        ),
    ]

-    create_request = AgenticSystemCreateRequest(
-        model="Meta-Llama3.1-8B-Instruct",
-        instance_config=AgenticSystemInstanceConfig(
-            instructions="You are a helpful assistant",
-            sampling_params=SamplingParams(),
-            available_tools=tool_definitions,
-            input_shields=[],
-            output_shields=[],
-            debug_prefix_messages=[],
-            tool_prompt_format=ToolPromptFormat.json,
-        ),
-    )
-
-    create_response = await api.create_agentic_system(create_request)
-    print(create_response)
-
-    session_response = await api.create_agentic_system_session(
-        AgenticSystemSessionCreateRequest(
-            system_id=create_response.system_id,
-            session_name="test_session",
-        )
-    )
-    print(session_response)
-
    user_prompts = [
        "Who are you?",
        "what is the 100th prime number?",
@ -158,26 +164,51 @@ async def run_main(host: str, port: int):
        "Write code to check if a number is prime. Use that to check if 7 is prime",
        "What is the boiling point of polyjuicepotion ?",
    ]
-    for content in user_prompts:
-        cprint(f"User> {content}", color="blue")
-        iterator = api.create_agentic_system_turn(
-            AgenticSystemTurnCreateRequest(
-                system_id=create_response.system_id,
-                session_id=session_response.session_id,
-                messages=[
-                    UserMessage(content=content),
-                ],
-                stream=True,
-            )
+    await _run_agent(api, tool_definitions, user_prompts)
+
+
+async def run_rag(host: str, port: int):
+    api = AgenticSystemClient(f"http://{host}:{port}")
+
+    urls = [
+        "memory_optimizations.rst",
+        "chat.rst",
+        "llama3.rst",
+        "datasets.rst",
+        "qat_finetune.rst",
+        "lora_finetune.rst",
+    ]
+    attachments = [
+        Attachment(
+            content=URL(
+                uri=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}"
+            ),
+            mime_type="text/plain",
        )
+        for i, url in enumerate(urls)
+    ]

-        async for event, log in EventLogger().log(iterator):
-            if log is not None:
-                log.print()
+    # Alternatively, you can pre-populate the memory bank with documents for example,
+    # using `llama_toolchain.memory.client`. Then you can grab the bank_id
+    # from the output of that run.
+    tool_definitions = [
+        MemoryToolDefinition(
+            max_tokens_in_context=2048,
+            memory_bank_configs=[],
+        ),
+    ]
+
+    user_prompts = [
+        "How do I use Lora?",
+        "Tell me briefly about llama3 and torchtune",
+    ]
+
+    await _run_agent(api, tool_definitions, user_prompts, attachments)


-def main(host: str, port: int):
-    asyncio.run(run_main(host, port))
+def main(host: str, port: int, rag: bool = False):
+    fn = run_rag if rag else run_main
+    asyncio.run(fn(host, port))


 if __name__ == "__main__":
--- a/llama_toolchain/agentic_system/event_logger.py
+++ b/llama_toolchain/agentic_system/event_logger.py
@ -6,7 +6,7 @@

 from typing import Optional

-from llama_models.llama3.api.datatypes import ToolResponseMessage
+from llama_models.llama3.api.datatypes import *  # noqa: F403
 from llama_models.llama3.api.tool_utils import ToolUtils

 from termcolor import cprint
@ -44,7 +44,12 @@ EventType = AgenticSystemTurnResponseEventType


 class EventLogger:
-    async def log(self, event_generator, stream=True):
+    async def log(
+        self,
+        event_generator,
+        stream=True,
+        tool_prompt_format: ToolPromptFormat = ToolPromptFormat.json,
+    ):
        previous_event_type = None
        previous_step_type = None

@ -132,7 +137,9 @@ class EventLogger:
                    if event_type == EventType.step_complete.value:
                        response = event.payload.step_details.model_response
                        if response.tool_calls:
-                            content = ToolUtils.encode_tool_call(response.tool_calls[0])
+                            content = ToolUtils.encode_tool_call(
+                                response.tool_calls[0], tool_prompt_format
+                            )
                        else:
                            content = response.content
                        yield event, LogEvent(
@ -162,5 +169,19 @@ class EventLogger:
                        color="green",
                    )

+            if (
+                step_type == StepType.memory_retrieval
+                and event_type == EventType.step_complete.value
+            ):
+                details = event.payload.step_details
+                content = interleaved_text_media_as_str(details.inserted_context)
+                content = content[:200] + "..." if len(content) > 200 else content
+
+                yield event, LogEvent(
+                    role=step_type,
+                    content=f"Retrieved context from banks: {details.memory_bank_ids}.\n====\n{content}\n>",
+                    color="cyan",
+                )
+
            preivous_event_type = event_type
            previous_step_type = step_type
--- a/llama_toolchain/agentic_system/execute_with_custom_tools.py
+++ b/llama_toolchain/agentic_system/execute_with_custom_tools.py
@ -0,0 +1,96 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from typing import AsyncGenerator, List
+
+from llama_models.llama3.api.datatypes import *  # noqa: F403
+from llama_toolchain.agentic_system.api import *  # noqa: F403
+from llama_toolchain.memory.api import *  # noqa: F403
+from llama_toolchain.safety.api import *  # noqa: F403
+
+from llama_toolchain.agentic_system.api import (
+    AgenticSystemTurnResponseEventType as EventType,
+)
+from llama_toolchain.tools.custom.datatypes import CustomTool
+
+
+class AgentWithCustomToolExecutor:
+    def __init__(
+        self,
+        api: AgenticSystem,
+        agent_id: str,
+        session_id: str,
+        agent_config: AgentConfig,
+        custom_tools: List[CustomTool],
+    ):
+        self.api = api
+        self.agent_id = agent_id
+        self.session_id = session_id
+        self.agent_config = agent_config
+        self.custom_tools = custom_tools
+
+    async def execute_turn(
+        self,
+        messages: List[Message],
+        attachments: Optional[List[Attachment]] = None,
+        max_iters: int = 5,
+        stream: bool = True,
+    ) -> AsyncGenerator:
+        tools_dict = {t.get_name(): t for t in self.custom_tools}
+
+        current_messages = messages.copy()
+        n_iter = 0
+        while n_iter < max_iters:
+            n_iter += 1
+
+            request = AgenticSystemTurnCreateRequest(
+                agent_id=self.agent_id,
+                session_id=self.session_id,
+                messages=current_messages,
+                attachments=attachments,
+                stream=stream,
+            )
+
+            turn = None
+            async for chunk in self.api.create_agentic_system_turn(request):
+                if chunk.event.payload.event_type != EventType.turn_complete.value:
+                    yield chunk
+                else:
+                    turn = chunk.event.payload.turn
+
+            message = turn.output_message
+            if len(message.tool_calls) == 0:
+                yield chunk
+                return
+
+            if message.stop_reason == StopReason.out_of_tokens:
+                yield chunk
+                return
+
+            tool_call = message.tool_calls[0]
+            if tool_call.tool_name not in tools_dict:
+                m = ToolResponseMessage(
+                    call_id=tool_call.call_id,
+                    tool_name=tool_call.tool_name,
+                    content=f"Unknown tool `{tool_call.tool_name}` was called. Try again with something else",
+                )
+                next_message = m
+            else:
+                tool = tools_dict[tool_call.tool_name]
+                result_messages = await execute_custom_tool(tool, message)
+                next_message = result_messages[0]
+
+            yield next_message
+            current_messages = [next_message]
+
+
+async def execute_custom_tool(tool: CustomTool, message: Message) -> List[Message]:
+    result_messages = await tool.run([message])
+    assert (
+        len(result_messages) == 1
+    ), f"Expected single message, got {len(result_messages)}"
+
+    return result_messages
--- a/llama_toolchain/agentic_system/meta_reference/init.py
+++ b/llama_toolchain/agentic_system/meta_reference/init.py
@ -4,5 +4,27 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from .agentic_system import get_provider_impl  # noqa
-from .config import AgenticSystemConfig  # noqa
+from typing import Dict
+
+from llama_toolchain.core.datatypes import Api, ProviderSpec
+
+from .config import MetaReferenceImplConfig
+
+
+async def get_provider_impl(
+    config: MetaReferenceImplConfig, deps: Dict[Api, ProviderSpec]
+):
+    from .agentic_system import MetaReferenceAgenticSystemImpl
+
+    assert isinstance(
+        config, MetaReferenceImplConfig
+    ), f"Unexpected config type: {type(config)}"
+
+    impl = MetaReferenceAgenticSystemImpl(
+        config,
+        deps[Api.inference],
+        deps[Api.memory],
+        deps[Api.safety],
+    )
+    await impl.initialize()
+    return impl
--- a/llama_toolchain/agentic_system/meta_reference/agent_instance.py
+++ b/llama_toolchain/agentic_system/meta_reference/agent_instance.py
@ -4,111 +4,111 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-
+import asyncio
 import copy
+import os
+import secrets
+import shutil
+import string
+import tempfile
 import uuid
 from datetime import datetime
-from typing import AsyncGenerator, List, Optional
+from typing import AsyncGenerator, List, Tuple
+from urllib.parse import urlparse
+
+import httpx

 from termcolor import cprint

-from llama_toolchain.agentic_system.api.datatypes import (
-    AgenticSystemInstanceConfig,
-    AgenticSystemTurnResponseEvent,
-    AgenticSystemTurnResponseEventType,
-    AgenticSystemTurnResponseStepCompletePayload,
-    AgenticSystemTurnResponseStepProgressPayload,
-    AgenticSystemTurnResponseStepStartPayload,
-    AgenticSystemTurnResponseTurnCompletePayload,
-    AgenticSystemTurnResponseTurnStartPayload,
-    InferenceStep,
-    Session,
-    ShieldCallStep,
-    StepType,
-    ToolExecutionStep,
-    ToolPromptFormat,
-    Turn,
-)
+from llama_toolchain.agentic_system.api import *  # noqa: F403
+from llama_toolchain.inference.api import *  # noqa: F403
+from llama_toolchain.memory.api import *  # noqa: F403
+from llama_toolchain.safety.api import *  # noqa: F403

-from llama_toolchain.inference.api import ChatCompletionRequest, Inference
-
-from llama_toolchain.inference.api.datatypes import (
-    Attachment,
-    BuiltinTool,
-    ChatCompletionResponseEventType,
-    CompletionMessage,
-    Message,
-    Role,
-    SamplingParams,
-    StopReason,
-    ToolCallDelta,
-    ToolCallParseStatus,
-    ToolDefinition,
-    ToolResponse,
-    ToolResponseMessage,
-    URL,
+from llama_toolchain.tools.base import BaseTool
+from llama_toolchain.tools.builtin import (
+    interpret_content_as_attachment,
+    SingleMessageBuiltinTool,
 )
-from llama_toolchain.safety.api import Safety
-from llama_toolchain.safety.api.datatypes import (
-    BuiltinShield,
-    ShieldDefinition,
-    ShieldResponse,
-)
-from llama_toolchain.agentic_system.api.endpoints import *  # noqa

 from .safety import SafetyException, ShieldRunnerMixin
-from .system_prompt import get_agentic_prefix_messages
-from .tools.base import BaseTool
-from .tools.builtin import SingleMessageBuiltinTool


-class AgentInstance(ShieldRunnerMixin):
-    def __init__(
-        self,
-        system_id: int,
-        instance_config: AgenticSystemInstanceConfig,
-        model: str,
-        inference_api: Inference,
-        safety_api: Safety,
-        builtin_tools: List[SingleMessageBuiltinTool],
-        custom_tool_definitions: List[ToolDefinition],
-        input_shields: List[ShieldDefinition],
-        output_shields: List[ShieldDefinition],
-        max_infer_iters: int = 10,
-        prefix_messages: Optional[List[Message]] = None,
-        tool_prompt_format: Optional[ToolPromptFormat] = ToolPromptFormat.json,
-    ):
-        self.system_id = system_id
-        self.instance_config = instance_config
-
-        self.model = model
-        self.inference_api = inference_api
-        self.safety_api = safety_api
-
-        if prefix_messages is not None and len(prefix_messages) > 0:
-            self.prefix_messages = prefix_messages
-        else:
-            self.prefix_messages = get_agentic_prefix_messages(
-                builtin_tools,
-                custom_tool_definitions,
-                tool_prompt_format,
+def make_random_string(length: int = 8):
+    return "".join(
+        secrets.choice(string.ascii_letters + string.digits) for _ in range(length)
    )

-        for m in self.prefix_messages:
-            print(m.content)
+
+class ChatAgent(ShieldRunnerMixin):
+    def __init__(
+        self,
+        agent_config: AgentConfig,
+        inference_api: Inference,
+        memory_api: Memory,
+        safety_api: Safety,
+        builtin_tools: List[SingleMessageBuiltinTool],
+        max_infer_iters: int = 10,
+    ):
+        self.agent_config = agent_config
+        self.inference_api = inference_api
+        self.memory_api = memory_api
+        self.safety_api = safety_api

        self.max_infer_iters = max_infer_iters
        self.tools_dict = {t.get_name(): t for t in builtin_tools}

+        self.tempdir = tempfile.mkdtemp()
        self.sessions = {}

        ShieldRunnerMixin.__init__(
            self,
            safety_api,
-            input_shields=input_shields,
-            output_shields=output_shields,
+            input_shields=agent_config.input_shields,
+            output_shields=agent_config.output_shields,
        )

+    def __del__(self):
+        shutil.rmtree(self.tempdir)
+
+    def turn_to_messages(self, turn: Turn) -> List[Message]:
+        messages = []
+
+        # We do not want to keep adding RAG context to the input messages
+        # May be this should be a parameter of the agentic instance
+        # that can define its behavior in a custom way
+        for m in turn.input_messages:
+            msg = m.copy()
+            if isinstance(msg, UserMessage):
+                msg.context = None
+            messages.append(msg)
+
+        # messages.extend(turn.input_messages)
+        for step in turn.steps:
+            if step.step_type == StepType.inference.value:
+                messages.append(step.model_response)
+            elif step.step_type == StepType.tool_execution.value:
+                for response in step.tool_responses:
+                    messages.append(
+                        ToolResponseMessage(
+                            call_id=response.call_id,
+                            tool_name=response.tool_name,
+                            content=response.content,
+                        )
+                    )
+            elif step.step_type == StepType.shield_call.value:
+                response = step.response
+                if response.is_violation:
+                    # CompletionMessage itself in the ShieldResponse
+                    messages.append(
+                        CompletionMessage(
+                            content=response.violation_return_message,
+                            stop_reason=StopReason.end_of_turn,
+                        )
+                    )
+        # print_dialog(messages)
+        return messages
+
    def create_session(self, name: str) -> Session:
        session_id = str(uuid.uuid4())
        session = Session(
@ -131,32 +131,7 @@ class AgentInstance(ShieldRunnerMixin):

        messages = []
        for i, turn in enumerate(session.turns):
-            # print(f"turn {i}")
-            # print_dialog(turn.input_messages)
-            messages.extend(turn.input_messages)
-            for step in turn.steps:
-                if step.step_type == StepType.inference.value:
-                    messages.append(step.model_response)
-                elif step.step_type == StepType.tool_execution.value:
-                    for response in step.tool_responses:
-                        messages.append(
-                            ToolResponseMessage(
-                                call_id=response.call_id,
-                                tool_name=response.tool_name,
-                                content=response.content,
-                            )
-                        )
-                elif step.step_type == StepType.shield_call.value:
-                    response = step.response
-                    if response.is_violation:
-                        # TODO: Properly persist the
-                        # CompletionMessage itself in the ShieldResponse
-                        messages.append(
-                            CompletionMessage(
-                                content=response.violation_return_message,
-                                stop_reason=StopReason.end_of_turn,
-                            )
-                        )
+            messages.extend(self.turn_to_messages(turn))

        messages.extend(request.messages)

@ -164,7 +139,6 @@ class AgentInstance(ShieldRunnerMixin):
        # print_dialog(messages)

        turn_id = str(uuid.uuid4())
-        params = self.instance_config.sampling_params
        start_time = datetime.now()
        yield AgenticSystemTurnResponseStreamChunk(
            event=AgenticSystemTurnResponseEvent(
@ -177,12 +151,12 @@ class AgentInstance(ShieldRunnerMixin):
        steps = []
        output_message = None
        async for chunk in self.run(
+            session=session,
            turn_id=turn_id,
            input_messages=messages,
-            temperature=params.temperature,
-            top_p=params.top_p,
+            attachments=request.attachments or [],
+            sampling_params=self.agent_config.sampling_params,
            stream=request.stream,
-            max_gen_len=params.max_tokens,
        ):
            if isinstance(chunk, CompletionMessage):
                cprint(
@ -227,6 +201,53 @@ class AgentInstance(ShieldRunnerMixin):
        )
        yield chunk

+    async def run(
+        self,
+        session: Session,
+        turn_id: str,
+        input_messages: List[Message],
+        attachments: List[Attachment],
+        sampling_params: SamplingParams,
+        stream: bool = False,
+    ) -> AsyncGenerator:
+        # Doing async generators makes downstream code much simpler and everything amenable to
+        # streaming. However, it also makes things complicated here because AsyncGenerators cannot
+        # return a "final value" for the `yield from` statement. we simulate that by yielding a
+        # final boolean (to see whether an exception happened) and then explicitly testing for it.
+
+        async for res in self.run_shields_wrapper(
+            turn_id, input_messages, self.input_shields, "user-input"
+        ):
+            if isinstance(res, bool):
+                return
+            else:
+                yield res
+
+        async for res in self._run(
+            session, turn_id, input_messages, attachments, sampling_params, stream
+        ):
+            if isinstance(res, bool):
+                return
+            elif isinstance(res, CompletionMessage):
+                final_response = res
+                break
+            else:
+                yield res
+
+        assert final_response is not None
+        # for output shields run on the full input and output combination
+        messages = input_messages + [final_response]
+
+        async for res in self.run_shields_wrapper(
+            turn_id, messages, self.output_shields, "assistant-output"
+        ):
+            if isinstance(res, bool):
+                return
+            else:
+                yield res
+
+        yield final_response
+
    async def run_shields_wrapper(
        self,
        turn_id: str,
@ -288,65 +309,62 @@ class AgentInstance(ShieldRunnerMixin):
            )
        )

-    async def run(
-        self,
-        turn_id: str,
-        input_messages: List[Message],
-        temperature: float,
-        top_p: float,
-        stream: bool = False,
-        max_gen_len: Optional[int] = None,
-    ) -> AsyncGenerator:
-        # Doing async generators makes downstream code much simpler and everything amenable to
-        # stremaing. However, it also makes things complicated here because AsyncGenerators cannot
-        # return a "final value" for the `yield from` statement. we simulate that by yielding a
-        # final boolean (to see whether an exception happened) and then explicitly testing for it.
-
-        async for res in self.run_shields_wrapper(
-            turn_id, input_messages, self.input_shields, "user-input"
-        ):
-            if isinstance(res, bool):
-                return
-            else:
-                yield res
-
-        async for res in self._run(
-            turn_id, input_messages, temperature, top_p, stream, max_gen_len
-        ):
-            if isinstance(res, bool):
-                return
-            elif isinstance(res, CompletionMessage):
-                final_response = res
-                break
-            else:
-                yield res
-
-        assert final_response is not None
-        # for output shields run on the full input and output combination
-        messages = input_messages + [final_response]
-
-        async for res in self.run_shields_wrapper(
-            turn_id, messages, self.output_shields, "assistant-output"
-        ):
-            if isinstance(res, bool):
-                return
-            else:
-                yield res
-
-        yield final_response
-
    async def _run(
        self,
+        session: Session,
        turn_id: str,
        input_messages: List[Message],
-        temperature: float,
-        top_p: float,
+        attachments: List[Attachment],
+        sampling_params: SamplingParams,
        stream: bool = False,
-        max_gen_len: Optional[int] = None,
    ) -> AsyncGenerator:
-        input_messages = preprocess_dialog(input_messages, self.prefix_messages)
+        enabled_tools = set(t.type for t in self.agent_config.tools)
+        need_rag_context = await self._should_retrieve_context(
+            input_messages, attachments
+        )
+        if need_rag_context:
+            step_id = str(uuid.uuid4())
+            yield AgenticSystemTurnResponseStreamChunk(
+                event=AgenticSystemTurnResponseEvent(
+                    payload=AgenticSystemTurnResponseStepStartPayload(
+                        step_type=StepType.memory_retrieval.value,
+                        step_id=step_id,
+                    )
+                )
+            )

-        attachments = []
+            # TODO: find older context from the session and either replace it
+            # or append with a sliding window. this is really a very simplistic implementation
+            rag_context, bank_ids = await self._retrieve_context(
+                session, input_messages, attachments
+            )
+
+            step_id = str(uuid.uuid4())
+            yield AgenticSystemTurnResponseStreamChunk(
+                event=AgenticSystemTurnResponseEvent(
+                    payload=AgenticSystemTurnResponseStepCompletePayload(
+                        step_type=StepType.memory_retrieval.value,
+                        step_id=step_id,
+                        step_details=MemoryRetrievalStep(
+                            turn_id=turn_id,
+                            step_id=step_id,
+                            memory_bank_ids=bank_ids,
+                            inserted_context=rag_context or "",
+                        ),
+                    )
+                )
+            )
+
+            if rag_context:
+                last_message = input_messages[-1]
+                last_message.context = "\n".join(rag_context)
+
+        elif attachments and AgenticSystemTool.code_interpreter.value in enabled_tools:
+            urls = [a.content for a in attachments if isinstance(a.content, URL)]
+            msg = await attachment_message(self.tempdir, urls)
+            input_messages.append(msg)
+
+        output_attachments = []

        n_iter = 0
        while True:
@ -369,17 +387,13 @@ class AgentInstance(ShieldRunnerMixin):
                )
            )

-            # where are the available tools?
            req = ChatCompletionRequest(
-                model=self.model,
+                model=self.agent_config.model,
                messages=input_messages,
-                available_tools=self.instance_config.available_tools,
+                tools=self._get_tools(),
+                tool_prompt_format=self.agent_config.tool_prompt_format,
                stream=True,
-                sampling_params=SamplingParams(
-                    temperature=temperature,
-                    top_p=top_p,
-                    max_tokens=max_gen_len,
-                ),
+                sampling_params=sampling_params,
            )

            tool_calls = []
@ -464,7 +478,8 @@ class AgentInstance(ShieldRunnerMixin):

            if len(message.tool_calls) == 0:
                if stop_reason == StopReason.end_of_turn:
-                    if len(attachments) > 0:
+                    # TODO: UPDATE RETURN TYPE TO SEND A TUPLE OF (MESSAGE, ATTACHMENTS)
+                    if len(output_attachments) > 0:
                        if isinstance(message.content, list):
                            message.content += attachments
                        else:
@ -572,63 +587,175 @@ class AgentInstance(ShieldRunnerMixin):
                    yield False
                    return

-                if isinstance(result_message.content, Attachment):
+                if out_attachment := interpret_content_as_attachment(
+                    result_message.content
+                ):
                    # NOTE: when we push this message back to the model, the model may ignore the
                    # attached file path etc. since the model is trained to only provide a user message
                    # with the summary. We keep all generated attachments and then attach them to final message
-                    attachments.append(result_message.content)
-                elif isinstance(result_message.content, list) or isinstance(
-                    result_message.content, tuple
-                ):
-                    for c in result_message.content:
-                        if isinstance(c, Attachment):
-                            attachments.append(c)
+                    output_attachments.append(out_attachment)

                input_messages = input_messages + [message, result_message]

            n_iter += 1

+    async def _ensure_memory_bank(self, session: Session) -> MemoryBank:
+        if session.memory_bank is None:
+            session.memory_bank = await self.memory_api.create_memory_bank(
+                name=f"memory_bank_{session.session_id}",
+                config=VectorMemoryBankConfig(
+                    embedding_model="sentence-transformer/all-MiniLM-L6-v2",
+                    chunk_size_in_tokens=512,
+                ),
+            )

-def attachment_message(url: URL) -> ToolResponseMessage:
+        return session.memory_bank
+
+    async def _should_retrieve_context(
+        self, messages: List[Message], attachments: List[Attachment]
+    ) -> bool:
+        enabled_tools = set(t.type for t in self.agent_config.tools)
+        if attachments:
+            if (
+                AgenticSystemTool.code_interpreter.value in enabled_tools
+                and self.agent_config.tool_choice == ToolChoice.required
+            ):
+                return False
+            else:
+                return True
+
+        return AgenticSystemTool.memory.value in enabled_tools
+
+    def _memory_tool_definition(self) -> Optional[MemoryToolDefinition]:
+        for t in self.agent_config.tools:
+            if t.type == AgenticSystemTool.memory.value:
+                return t
+
+        return None
+
+    async def _retrieve_context(
+        self, session: Session, messages: List[Message], attachments: List[Attachment]
+    ) -> Tuple[List[str], List[int]]:  # (rag_context, bank_ids)
+        bank_ids = []
+
+        memory = self._memory_tool_definition()
+        assert memory is not None, "Memory tool not configured"
+        bank_ids.extend(c.bank_id for c in memory.memory_bank_configs)
+
+        if attachments:
+            bank = await self._ensure_memory_bank(session)
+            bank_ids.append(bank.bank_id)
+
+            documents = [
+                MemoryBankDocument(
+                    document_id=str(uuid.uuid4()),
+                    content=a.content,
+                    mime_type=a.mime_type,
+                    metadata={},
+                )
+                for a in attachments
+            ]
+            await self.memory_api.insert_documents(bank.bank_id, documents)
+        elif session.memory_bank:
+            bank_ids.append(session.memory_bank.bank_id)
+
+        if not bank_ids:
+            # this can happen if the per-session memory bank is not yet populated
+            # (i.e., no prior turns uploaded an Attachment)
+            return None, []
+
+        query = " ".join(m.content for m in messages)
+        tasks = [
+            self.memory_api.query_documents(
+                bank_id=bank_id,
+                query=query,
+                params={
+                    "max_chunks": 5,
+                },
+            )
+            for bank_id in bank_ids
+        ]
+        results: List[QueryDocumentsResponse] = await asyncio.gather(*tasks)
+        chunks = [c for r in results for c in r.chunks]
+        scores = [s for r in results for s in r.scores]
+
+        # sort by score
+        chunks, scores = zip(
+            *sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
+        )
+        if not chunks:
+            return None, bank_ids
+
+        tokens = 0
+        picked = []
+        for c in chunks[: memory.max_chunks]:
+            tokens += c.token_count
+            if tokens > memory.max_tokens_in_context:
+                cprint(
+                    f"Using {len(picked)} chunks; reached max tokens in context: {tokens}",
+                    "red",
+                )
+                break
+            picked.append(f"id:{c.document_id}; content:{c.content}")
+
+        return [
+            "Here are the retrieved documents for relevant context:\n=== START-RETRIEVED-CONTEXT ===\n",
+            *picked,
+            "\n=== END-RETRIEVED-CONTEXT ===\n",
+        ], bank_ids
+
+    def _get_tools(self) -> List[ToolDefinition]:
+        ret = []
+        for t in self.agent_config.tools:
+            if isinstance(t, BraveSearchToolDefinition):
+                ret.append(ToolDefinition(tool_name=BuiltinTool.brave_search))
+            elif isinstance(t, WolframAlphaToolDefinition):
+                ret.append(ToolDefinition(tool_name=BuiltinTool.wolfram_alpha))
+            elif isinstance(t, PhotogenToolDefinition):
+                ret.append(ToolDefinition(tool_name=BuiltinTool.photogen))
+            elif isinstance(t, CodeInterpreterToolDefinition):
+                ret.append(ToolDefinition(tool_name=BuiltinTool.code_interpreter))
+            elif isinstance(t, FunctionCallToolDefinition):
+                ret.append(
+                    ToolDefinition(
+                        tool_name=t.function_name,
+                        description=t.description,
+                        parameters=t.parameters,
+                    )
+                )
+        return ret
+
+
+async def attachment_message(tempdir: str, urls: List[URL]) -> ToolResponseMessage:
+    content = []
+
+    for url in urls:
        uri = url.uri
-    assert uri.startswith("file://")
+        if uri.startswith("file://"):
            filepath = uri[len("file://") :]
+        elif uri.startswith("http"):
+            path = urlparse(uri).path
+            basename = os.path.basename(path)
+            filepath = f"{tempdir}/{make_random_string() + basename}"
+            print(f"Downloading {url} -> {filepath}")
+
+            async with httpx.AsyncClient() as client:
+                r = await client.get(uri)
+                resp = r.text
+                with open(filepath, "w") as fp:
+                    fp.write(resp)
+        else:
+            raise ValueError(f"Unsupported URL {url}")
+
+        content.append(f'# There is a file accessible to you at "{filepath}"\n')

    return ToolResponseMessage(
        call_id="",
        tool_name=BuiltinTool.code_interpreter,
-        content=f'# There is a file accessible to you at "{filepath}"',
+        content=content,
    )


-def preprocess_dialog(
-    messages: List[Message], prefix_messages: List[Message]
-) -> List[Message]:
-    """
-    Preprocesses the dialog by removing the system message and
-    adding the system message to the beginning of the dialog.
-    """
-    ret = prefix_messages.copy()
-
-    for m in messages:
-        if m.role == Role.system.value:
-            continue
-
-        # NOTE: the ideal behavior is to use `file_path = ...` but that
-        # means we need to have stateful execution o    f code which we currently
-        # do not have.
-        if isinstance(m.content, Attachment):
-            ret.append(attachment_message(m.content.url))
-        elif isinstance(m.content, list):
-            for c in m.content:
-                if isinstance(c, Attachment):
-                    ret.append(attachment_message(c.url))
-
-        ret.append(m)
-
-    return ret
-
-
 async def execute_tool_call_maybe(
    tools_dict: Dict[str, BaseTool], messages: List[CompletionMessage]
 ) -> List[ToolResponseMessage]:
--- a/llama_toolchain/agentic_system/meta_reference/agentic_system.py
+++ b/llama_toolchain/agentic_system/meta_reference/agentic_system.py
@ -8,62 +8,42 @@
 import logging
 import os
 import uuid
-from typing import AsyncGenerator, Dict
+from typing import AsyncGenerator

-from llama_toolchain.distribution.datatypes import Api, ProviderSpec
 from llama_toolchain.inference.api import Inference
-from llama_toolchain.inference.api.datatypes import BuiltinTool
+from llama_toolchain.memory.api import Memory
 from llama_toolchain.safety.api import Safety
-from llama_toolchain.agentic_system.api.endpoints import *  # noqa
-from llama_toolchain.agentic_system.api import (
-    AgenticSystem,
-    AgenticSystemCreateRequest,
-    AgenticSystemCreateResponse,
-    AgenticSystemSessionCreateRequest,
-    AgenticSystemSessionCreateResponse,
-    AgenticSystemTurnCreateRequest,
-)
-
-from .agent_instance import AgentInstance
-
-from .config import AgenticSystemConfig
-
-from .tools.builtin import (
+from llama_toolchain.agentic_system.api import *  # noqa: F403
+from llama_toolchain.tools.builtin import (
    BraveSearchTool,
    CodeInterpreterTool,
    PhotogenTool,
    WolframAlphaTool,
 )
-from .tools.safety import with_safety
+from llama_toolchain.tools.safety import with_safety
+
+from .agent_instance import ChatAgent
+from .config import MetaReferenceImplConfig


 logger = logging.getLogger()
 logger.setLevel(logging.INFO)


-async def get_provider_impl(config: AgenticSystemConfig, deps: Dict[Api, ProviderSpec]):
-    assert isinstance(
-        config, AgenticSystemConfig
-    ), f"Unexpected config type: {type(config)}"
-
-    impl = MetaReferenceAgenticSystemImpl(
-        config,
-        deps[Api.inference],
-        deps[Api.safety],
-    )
-    await impl.initialize()
-    return impl
-
-
 AGENT_INSTANCES_BY_ID = {}


 class MetaReferenceAgenticSystemImpl(AgenticSystem):
    def __init__(
-        self, config: AgenticSystemConfig, inference_api: Inference, safety_api: Safety
+        self,
+        config: MetaReferenceImplConfig,
+        inference_api: Inference,
+        memory_api: Memory,
+        safety_api: Safety,
    ):
        self.config = config
        self.inference_api = inference_api
+        self.memory_api = memory_api
        self.safety_api = safety_api

    async def initialize(self) -> None:
@ -71,69 +51,61 @@ class MetaReferenceAgenticSystemImpl(AgenticSystem):

    async def create_agentic_system(
        self,
-        request: AgenticSystemCreateRequest,
+        agent_config: AgentConfig,
    ) -> AgenticSystemCreateResponse:
-        system_id = str(uuid.uuid4())
+        agent_id = str(uuid.uuid4())

        builtin_tools = []
-        custom_tool_definitions = []
-        cfg = request.instance_config
-        for dfn in cfg.available_tools:
-            if isinstance(dfn.tool_name, BuiltinTool):
-                if dfn.tool_name == BuiltinTool.wolfram_alpha:
+        for tool_defn in agent_config.tools:
+            if isinstance(tool_defn, WolframAlphaToolDefinition):
                key = self.config.wolfram_api_key
                if not key:
                    raise ValueError("Wolfram API key not defined in config")
                tool = WolframAlphaTool(key)
-                elif dfn.tool_name == BuiltinTool.brave_search:
+            elif isinstance(tool_defn, BraveSearchToolDefinition):
                key = self.config.brave_search_api_key
                if not key:
                    raise ValueError("Brave API key not defined in config")
                tool = BraveSearchTool(key)
-                elif dfn.tool_name == BuiltinTool.code_interpreter:
+            elif isinstance(tool_defn, CodeInterpreterToolDefinition):
                tool = CodeInterpreterTool()
-                elif dfn.tool_name == BuiltinTool.photogen:
+            elif isinstance(tool_defn, PhotogenToolDefinition):
                tool = PhotogenTool(
                    dump_dir="/tmp/photogen_dump_" + os.environ["USER"],
                )
            else:
-                    raise ValueError(f"Unknown builtin tool: {dfn.tool_name}")
+                continue

            builtin_tools.append(
                with_safety(
-                        tool, self.safety_api, dfn.input_shields, dfn.output_shields
+                    tool,
+                    self.safety_api,
+                    tool_defn.input_shields,
+                    tool_defn.output_shields,
                )
            )
-            else:
-                custom_tool_definitions.append(dfn)

-        AGENT_INSTANCES_BY_ID[system_id] = AgentInstance(
-            system_id=system_id,
-            instance_config=request.instance_config,
-            model=request.model,
+        AGENT_INSTANCES_BY_ID[agent_id] = ChatAgent(
+            agent_config=agent_config,
            inference_api=self.inference_api,
-            builtin_tools=builtin_tools,
-            custom_tool_definitions=custom_tool_definitions,
            safety_api=self.safety_api,
-            input_shields=cfg.input_shields,
-            output_shields=cfg.output_shields,
-            prefix_messages=cfg.debug_prefix_messages,
-            tool_prompt_format=cfg.tool_prompt_format,
+            memory_api=self.memory_api,
+            builtin_tools=builtin_tools,
        )

        return AgenticSystemCreateResponse(
-            system_id=system_id,
+            agent_id=agent_id,
        )

    async def create_agentic_system_session(
        self,
-        request: AgenticSystemSessionCreateRequest,
+        agent_id: str,
+        session_name: str,
    ) -> AgenticSystemSessionCreateResponse:
-        system_id = request.system_id
-        assert system_id in AGENT_INSTANCES_BY_ID, f"System {system_id} not found"
-        agent = AGENT_INSTANCES_BY_ID[system_id]
+        assert agent_id in AGENT_INSTANCES_BY_ID, f"System {agent_id} not found"
+        agent = AGENT_INSTANCES_BY_ID[agent_id]

-        session = agent.create_session(request.session_name)
+        session = agent.create_session(session_name)
        return AgenticSystemSessionCreateResponse(
            session_id=session.session_id,
        )
@ -142,9 +114,9 @@ class MetaReferenceAgenticSystemImpl(AgenticSystem):
        self,
        request: AgenticSystemTurnCreateRequest,
    ) -> AsyncGenerator:
-        system_id = request.system_id
-        assert system_id in AGENT_INSTANCES_BY_ID, f"System {system_id} not found"
-        agent = AGENT_INSTANCES_BY_ID[system_id]
+        agent_id = request.agent_id
+        assert agent_id in AGENT_INSTANCES_BY_ID, f"System {agent_id} not found"
+        agent = AGENT_INSTANCES_BY_ID[agent_id]

        assert (
            request.session_id in agent.sessions
--- a/llama_toolchain/agentic_system/meta_reference/config.py
+++ b/llama_toolchain/agentic_system/meta_reference/config.py
@ -9,6 +9,6 @@ from typing import Optional
 from pydantic import BaseModel


-class AgenticSystemConfig(BaseModel):
+class MetaReferenceImplConfig(BaseModel):
    brave_search_api_key: Optional[str] = None
    wolfram_api_key: Optional[str] = None
--- a/llama_toolchain/agentic_system/meta_reference/safety.py
+++ b/llama_toolchain/agentic_system/meta_reference/safety.py
@ -9,12 +9,13 @@ from typing import List
 from llama_models.llama3.api.datatypes import Message, Role, UserMessage
 from termcolor import cprint

-from llama_toolchain.safety.api.datatypes import (
+from llama_toolchain.safety.api import (
    OnViolationAction,
+    RunShieldRequest,
+    Safety,
    ShieldDefinition,
    ShieldResponse,
 )
-from llama_toolchain.safety.api.endpoints import RunShieldRequest, Safety


 class SafetyException(Exception):  # noqa: N818
--- a/llama_toolchain/agentic_system/meta_reference/system_prompt.py
+++ b/llama_toolchain/agentic_system/meta_reference/system_prompt.py
@ -1,180 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-import json
-import textwrap
-from datetime import datetime
-from typing import List
-
-from llama_toolchain.agentic_system.api.datatypes import ToolPromptFormat
-
-from llama_toolchain.inference.api import (
-    BuiltinTool,
-    Message,
-    SystemMessage,
-    ToolDefinition,
-    UserMessage,
-)
-
-from .tools.builtin import SingleMessageBuiltinTool
-
-
-def get_agentic_prefix_messages(
-    builtin_tools: List[SingleMessageBuiltinTool],
-    custom_tools: List[ToolDefinition],
-    tool_prompt_format: ToolPromptFormat,
-) -> List[Message]:
-    messages = []
-    content = ""
-    if builtin_tools:
-        content += "Environment: ipython\n"
-
-        tool_str = ", ".join(
-            [
-                t.get_name()
-                for t in builtin_tools
-                if t.get_name() != BuiltinTool.code_interpreter.value
-            ]
-        )
-        if tool_str:
-            content += f"Tools: {tool_str}"
-
-    current_date = datetime.now()
-    formatted_date = current_date.strftime("%d %B %Y")
-    date_str = f"""
-Cutting Knowledge Date: December 2023
-Today Date: {formatted_date}\n"""
-    content += date_str
-    messages.append(SystemMessage(content=content))
-
-    if custom_tools:
-        if tool_prompt_format == ToolPromptFormat.function_tag:
-            text = prompt_for_function_tag(custom_tools)
-            messages.append(UserMessage(content=text))
-        elif tool_prompt_format == ToolPromptFormat.json:
-            text = prompt_for_json(custom_tools)
-            messages.append(UserMessage(content=text))
-        else:
-            raise NotImplementedError(
-                f"Tool prompt format {tool_prompt_format} is not supported"
-            )
-    else:
-        messages.append(SystemMessage(content=content))
-
-    return messages
-
-
-def prompt_for_json(custom_tools: List[ToolDefinition]) -> str:
-    tool_defs = "\n".join(
-        translate_custom_tool_definition_to_json(t) for t in custom_tools
-    )
-    content = textwrap.dedent(
-        """
-        Answer the user's question by making use of the following functions if needed.
-        If none of the function can be used, please say so.
-        Here is a list of functions in JSON format:
-        {tool_defs}
-
-        Return function calls in JSON format.
-        """
-    )
-    content = content.lstrip("\n").format(tool_defs=tool_defs)
-    return content
-
-
-def prompt_for_function_tag(custom_tools: List[ToolDefinition]) -> str:
-    custom_tool_params = ""
-    for t in custom_tools:
-        custom_tool_params += get_instruction_string(t) + "\n"
-        custom_tool_params += get_parameters_string(t) + "\n\n"
-
-    content = f"""
-You have access to the following functions:
-
-{custom_tool_params}
-Think very carefully before calling functions.
-If you choose to call a function ONLY reply in the following format with no prefix or suffix:
-
-<function=example_function_name>{{"example_name": "example_value"}}</function>
-
-Reminder:
- If looking for real time information use relevant functions before falling back to brave_search
- Function calls MUST follow the specified format, start with <function= and end with </function>
- Required parameters MUST be specified
- Only call one function at a time
- Put the entire function call reply on one line
-"""
-    return content
-
-
-def get_instruction_string(custom_tool_definition) -> str:
-    return f"Use the function '{custom_tool_definition.tool_name}' to '{custom_tool_definition.description}'"
-
-
-def get_parameters_string(custom_tool_definition) -> str:
-    return json.dumps(
-        {
-            "name": custom_tool_definition.tool_name,
-            "description": custom_tool_definition.description,
-            "parameters": {
-                name: definition.__dict__
-                for name, definition in custom_tool_definition.parameters.items()
-            },
-        }
-    )
-
-
-def translate_custom_tool_definition_to_json(tool_def):
-    """Translates ToolDefinition to json as expected by model
-    eg. output for a function
-    {
-        "type": "function",
-        "function": {
-            "name": "conv_int",
-            "description": "Convert serialized fract24 integer into int value.",
-            "parameters": {
-                "type": "object",
-                "properties": [
-                    {
-                        "data": {
-                            "type": "object",
-                            "description": ""
-                        }
-                    }
-                ],
-                "required": ["data"]
-            }
-        }
-    }
-    """
-    assert isinstance(tool_def.tool_name, str)
-    func_def = {"type": "function", "function": {}}
-    func_def["function"]["name"] = tool_def.tool_name
-    func_def["function"]["description"] = tool_def.description or ""
-    if tool_def.parameters:
-        required = []
-        properties = []
-        for p_name, p_def in tool_def.parameters.items():
-            properties.append(
-                {
-                    p_name: {
-                        # TODO: see if this should not always be object
-                        "type": "object",
-                        "description": p_def.description or "",
-                    }
-                }
-            )
-            if p_def.required:
-                required.append(p_name)
-        func_def["function"]["parameters"] = {
-            "type": "object",
-            "properties": properties,
-            "required": required,
-        }
-    else:
-        func_def["function"]["parameters"] = {}
-
-    return json.dumps(func_def, indent=4)
--- a/llama_toolchain/agentic_system/providers.py
+++ b/llama_toolchain/agentic_system/providers.py
@ -6,7 +6,7 @@

 from typing import List

-from llama_toolchain.distribution.datatypes import Api, InlineProviderSpec, ProviderSpec
+from llama_toolchain.core.datatypes import Api, InlineProviderSpec, ProviderSpec


 def available_agentic_system_providers() -> List[ProviderSpec]:
@ -16,15 +16,19 @@ def available_agentic_system_providers() -> List[ProviderSpec]:
            provider_id="meta-reference",
            pip_packages=[
                "codeshield",
+                "matplotlib",
                "pillow",
+                "pandas",
+                "scikit-learn",
                "torch",
                "transformers",
            ],
            module="llama_toolchain.agentic_system.meta_reference",
-            config_class="llama_toolchain.agentic_system.meta_reference.AgenticSystemConfig",
+            config_class="llama_toolchain.agentic_system.meta_reference.MetaReferenceImplConfig",
            api_dependencies=[
                Api.inference,
                Api.safety,
+                Api.memory,
            ],
        ),
    ]
--- a/llama_toolchain/agentic_system/tools/custom/execute.py
+++ b/llama_toolchain/agentic_system/tools/custom/execute.py
@ -1,83 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from typing import Any, AsyncGenerator, List
-
-from llama_models.llama3.api.datatypes import StopReason, ToolResponseMessage
-
-from llama_toolchain.agentic_system.api import (
-    AgenticSystem,
-    AgenticSystemTurnCreateRequest,
-    AgenticSystemTurnResponseEventType as EventType,
-)
-
-from llama_toolchain.inference.api import Message
-
-
-async def execute_with_custom_tools(
-    system: AgenticSystem,
-    system_id: str,
-    session_id: str,
-    messages: List[Message],
-    custom_tools: List[Any],
-    max_iters: int = 5,
-    stream: bool = True,
-) -> AsyncGenerator:
-    # first create a session, or do you keep a persistent session?
-    tools_dict = {t.get_name(): t for t in custom_tools}
-
-    current_messages = messages.copy()
-    n_iter = 0
-    while n_iter < max_iters:
-        n_iter += 1
-
-        request = AgenticSystemTurnCreateRequest(
-            system_id=system_id,
-            session_id=session_id,
-            messages=current_messages,
-            stream=stream,
-        )
-
-        turn = None
-        async for chunk in system.create_agentic_system_turn(request):
-            if chunk.event.payload.event_type != EventType.turn_complete.value:
-                yield chunk
-            else:
-                turn = chunk.event.payload.turn
-
-        message = turn.output_message
-        if len(message.tool_calls) == 0:
-            yield chunk
-            return
-
-        if message.stop_reason == StopReason.out_of_tokens:
-            yield chunk
-            return
-
-        tool_call = message.tool_calls[0]
-        if tool_call.tool_name not in tools_dict:
-            m = ToolResponseMessage(
-                call_id=tool_call.call_id,
-                tool_name=tool_call.tool_name,
-                content=f"Unknown tool `{tool_call.tool_name}` was called. Try again with something else",
-            )
-            next_message = m
-        else:
-            tool = tools_dict[tool_call.tool_name]
-            result_messages = await execute_custom_tool(tool, message)
-            next_message = result_messages[0]
-
-        yield next_message
-        current_messages = [next_message]
-
-
-async def execute_custom_tool(tool: Any, message: Message) -> List[Message]:
-    result_messages = await tool.run([message])
-    assert (
-        len(result_messages) == 1
-    ), f"Expected single message, got {len(result_messages)}"
-
-    return result_messages
--- a/llama_toolchain/agentic_system/utils.py
+++ b/llama_toolchain/agentic_system/utils.py
@ -1,122 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-import uuid
-from typing import Any, List, Optional
-
-from llama_models.llama3.api.datatypes import BuiltinTool, Message, SamplingParams
-
-from llama_toolchain.agentic_system.api import (
-    AgenticSystemCreateRequest,
-    AgenticSystemInstanceConfig,
-    AgenticSystemSessionCreateRequest,
-    AgenticSystemToolDefinition,
-)
-from llama_toolchain.agentic_system.api.datatypes import ToolPromptFormat
-from llama_toolchain.agentic_system.client import AgenticSystemClient
-
-from llama_toolchain.agentic_system.tools.custom.execute import (
-    execute_with_custom_tools,
-)
-from llama_toolchain.safety.api.datatypes import BuiltinShield, ShieldDefinition
-
-
-# TODO: this should move back to the llama-agentic-system repo
-
-
-class AgenticSystemClientWrapper:
-    def __init__(self, api, system_id, custom_tools):
-        self.api = api
-        self.system_id = system_id
-        self.custom_tools = custom_tools
-        self.session_id = None
-
-    async def create_session(self, name: str = None):
-        if name is None:
-            name = f"Session-{uuid.uuid4()}"
-
-        response = await self.api.create_agentic_system_session(
-            AgenticSystemSessionCreateRequest(
-                system_id=self.system_id,
-                session_name=name,
-            )
-        )
-        self.session_id = response.session_id
-        return self.session_id
-
-    async def run(self, messages: List[Message], stream: bool = True):
-        async for chunk in execute_with_custom_tools(
-            self.api,
-            self.system_id,
-            self.session_id,
-            messages,
-            self.custom_tools,
-            stream=stream,
-        ):
-            yield chunk
-
-
-async def get_agent_system_instance(
-    host: str,
-    port: int,
-    custom_tools: Optional[List[Any]] = None,
-    disable_safety: bool = False,
-    model: str = "Meta-Llama3.1-8B-Instruct",
-    tool_prompt_format: ToolPromptFormat = ToolPromptFormat.json,
-) -> AgenticSystemClientWrapper:
-    custom_tools = custom_tools or []
-
-    api = AgenticSystemClient(base_url=f"http://{host}:{port}")
-
-    tool_definitions = [
-        AgenticSystemToolDefinition(
-            tool_name=BuiltinTool.brave_search,
-        ),
-        AgenticSystemToolDefinition(
-            tool_name=BuiltinTool.wolfram_alpha,
-        ),
-        AgenticSystemToolDefinition(
-            tool_name=BuiltinTool.photogen,
-        ),
-        AgenticSystemToolDefinition(
-            tool_name=BuiltinTool.code_interpreter,
-        ),
-    ] + [t.get_tool_definition() for t in custom_tools]
-
-    if not disable_safety:
-        for t in tool_definitions:
-            t.input_shields = [ShieldDefinition(shield_type=BuiltinShield.llama_guard)]
-            t.output_shields = [
-                ShieldDefinition(shield_type=BuiltinShield.llama_guard),
-                ShieldDefinition(shield_type=BuiltinShield.injection_shield),
-            ]
-
-    create_request = AgenticSystemCreateRequest(
-        model=model,
-        instance_config=AgenticSystemInstanceConfig(
-            instructions="You are a helpful assistant",
-            available_tools=tool_definitions,
-            input_shields=(
-                []
-                if disable_safety
-                else [
-                    ShieldDefinition(shield_type=BuiltinShield.llama_guard),
-                    ShieldDefinition(shield_type=BuiltinShield.jailbreak_shield),
-                ]
-            ),
-            output_shields=(
-                []
-                if disable_safety
-                else [
-                    ShieldDefinition(shield_type=BuiltinShield.llama_guard),
-                ]
-            ),
-            sampling_params=SamplingParams(),
-            tool_prompt_format=tool_prompt_format,
-        ),
-    )
-    create_response = await api.create_agentic_system(create_request)
-    return AgenticSystemClientWrapper(api, create_response.system_id, custom_tools)
--- a/llama_toolchain/agentic_system/meta_reference/tools/init.py
+++ b/llama_toolchain/agentic_system/meta_reference/tools/init.py
--- a/llama_toolchain/batch_inference/api/init.py
+++ b/llama_toolchain/batch_inference/api/init.py
@ -4,4 +4,4 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from .distribution import DistributionParser  # noqa
+from .api import *  # noqa: F401 F403
--- a/llama_toolchain/batch_inference/api/api.py
+++ b/llama_toolchain/batch_inference/api/api.py
@ -0,0 +1,61 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from typing import List, Optional, Protocol
+
+from llama_models.schema_utils import json_schema_type, webmethod
+
+from pydantic import BaseModel, Field
+
+from llama_models.llama3.api.datatypes import *  # noqa: F403
+from llama_toolchain.inference.api import *  # noqa: F403
+
+
+@json_schema_type
+class BatchCompletionRequest(BaseModel):
+    model: str
+    content_batch: List[InterleavedTextMedia]
+    sampling_params: Optional[SamplingParams] = SamplingParams()
+    logprobs: Optional[LogProbConfig] = None
+
+
+@json_schema_type
+class BatchCompletionResponse(BaseModel):
+    completion_message_batch: List[CompletionMessage]
+
+
+@json_schema_type
+class BatchChatCompletionRequest(BaseModel):
+    model: str
+    messages_batch: List[List[Message]]
+    sampling_params: Optional[SamplingParams] = SamplingParams()
+
+    # zero-shot tool definitions as input to the model
+    tools: Optional[List[ToolDefinition]] = Field(default_factory=list)
+    tool_choice: Optional[ToolChoice] = Field(default=ToolChoice.auto)
+    tool_prompt_format: Optional[ToolPromptFormat] = Field(
+        default=ToolPromptFormat.json
+    )
+    logprobs: Optional[LogProbConfig] = None
+
+
+@json_schema_type
+class BatchChatCompletionResponse(BaseModel):
+    completion_message_batch: List[CompletionMessage]
+
+
+class BatchInference(Protocol):
+    @webmethod(route="/batch_inference/completion")
+    async def batch_completion(
+        self,
+        request: BatchCompletionRequest,
+    ) -> BatchCompletionResponse: ...
+
+    @webmethod(route="/batch_inference/chat_completion")
+    async def batch_chat_completion(
+        self,
+        request: BatchChatCompletionRequest,
+    ) -> BatchChatCompletionResponse: ...
--- a/llama_toolchain/cli/distribution/configure.py
+++ b/llama_toolchain/cli/distribution/configure.py
@ -1,106 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-import argparse
-import json
-import shlex
-
-import yaml
-
-from llama_toolchain.cli.subcommand import Subcommand
-from llama_toolchain.common.config_dirs import DISTRIBS_BASE_DIR
-from termcolor import cprint
-
-
-class DistributionConfigure(Subcommand):
-    """Llama cli for configuring llama toolchain configs"""
-
-    def __init__(self, subparsers: argparse._SubParsersAction):
-        super().__init__()
-        self.parser = subparsers.add_parser(
-            "configure",
-            prog="llama distribution configure",
-            description="configure a llama stack distribution",
-            formatter_class=argparse.RawTextHelpFormatter,
-        )
-        self._add_arguments()
-        self.parser.set_defaults(func=self._run_distribution_configure_cmd)
-
-    def _add_arguments(self):
-        self.parser.add_argument(
-            "--name",
-            type=str,
-            help="Name of the distribution to configure",
-            required=True,
-        )
-
-    def _run_distribution_configure_cmd(self, args: argparse.Namespace) -> None:
-        from llama_toolchain.distribution.datatypes import DistributionConfig
-        from llama_toolchain.distribution.registry import resolve_distribution_spec
-
-        config_file = DISTRIBS_BASE_DIR / args.name / "config.yaml"
-        if not config_file.exists():
-            self.parser.error(
-                f"Could not find {config_file}. Please run `llama distribution install` first"
-            )
-            return
-
-        # we need to find the spec from the name
-        with open(config_file, "r") as f:
-            config = DistributionConfig(**yaml.safe_load(f))
-
-        dist = resolve_distribution_spec(config.spec)
-        if dist is None:
-            raise ValueError(f"Could not find any registered spec `{config.spec}`")
-
-        configure_llama_distribution(dist, config)
-
-
-def configure_llama_distribution(dist: "Distribution", config: "DistributionConfig"):
-    from llama_toolchain.common.exec import run_command
-    from llama_toolchain.common.prompt_for_config import prompt_for_config
-    from llama_toolchain.common.serialize import EnumEncoder
-    from llama_toolchain.distribution.dynamic import instantiate_class_type
-
-    python_exe = run_command(shlex.split("which python"))
-    # simple check
-    conda_env = config.conda_env
-    if conda_env not in python_exe:
-        raise ValueError(
-            f"Please re-run configure by activating the `{conda_env}` conda environment"
-        )
-
-    if config.providers:
-        cprint(
-            f"Configuration already exists for {config.name}. Will overwrite...",
-            "yellow",
-            attrs=["bold"],
-        )
-
-    for api, provider_spec in dist.provider_specs.items():
-        cprint(f"Configuring API surface: {api.value}", "white", attrs=["bold"])
-        config_type = instantiate_class_type(provider_spec.config_class)
-        provider_config = prompt_for_config(
-            config_type,
-            (
-                config_type(**config.providers[api.value])
-                if api.value in config.providers
-                else None
-            ),
-        )
-        print("")
-
-        config.providers[api.value] = {
-            "provider_id": provider_spec.provider_id,
-            **provider_config.dict(),
-        }
-
-    config_path = DISTRIBS_BASE_DIR / config.name / "config.yaml"
-    with open(config_path, "w") as fp:
-        dist_config = json.loads(json.dumps(config.dict(), cls=EnumEncoder))
-        fp.write(yaml.dump(dist_config, sort_keys=False))
-
-    print(f"YAML configuration has been written to {config_path}")
--- a/llama_toolchain/cli/distribution/create.py
+++ b/llama_toolchain/cli/distribution/create.py
@ -1,43 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-import argparse
-
-from llama_toolchain.cli.subcommand import Subcommand
-
-
-class DistributionCreate(Subcommand):
-    def __init__(self, subparsers: argparse._SubParsersAction):
-        super().__init__()
-        self.parser = subparsers.add_parser(
-            "create",
-            prog="llama distribution create",
-            description="create a Llama stack distribution",
-            formatter_class=argparse.RawTextHelpFormatter,
-        )
-        self._add_arguments()
-        self.parser.set_defaults(func=self._run_distribution_create_cmd)
-
-    def _add_arguments(self):
-        self.parser.add_argument(
-            "--name",
-            type=str,
-            help="Name of the distribution to create",
-            required=True,
-        )
-        # for each Api the user wants to support, we should
-        # get the list of available providers, ask which one the user
-        # wants to pick and then ask for their configuration.
-
-    def _run_distribution_create_cmd(self, args: argparse.Namespace) -> None:
-        from llama_toolchain.distribution.registry import resolve_distribution_spec
-
-        dist = resolve_distribution_spec(args.name)
-        if dist is not None:
-            self.parser.error(f"Distribution with name {args.name} already exists")
-            return
-
-        raise NotImplementedError()
--- a/llama_toolchain/cli/distribution/distribution.py
+++ b/llama_toolchain/cli/distribution/distribution.py
@ -1,34 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-import argparse
-
-from llama_toolchain.cli.subcommand import Subcommand
-
-from .configure import DistributionConfigure
-from .create import DistributionCreate
-from .install import DistributionInstall
-from .list import DistributionList
-from .start import DistributionStart
-
-
-class DistributionParser(Subcommand):
-    def __init__(self, subparsers: argparse._SubParsersAction):
-        super().__init__()
-        self.parser = subparsers.add_parser(
-            "distribution",
-            prog="llama distribution",
-            description="Operate on llama stack distributions",
-        )
-
-        subparsers = self.parser.add_subparsers(title="distribution_subcommands")
-
-        # Add sub-commands
-        DistributionList.create(subparsers)
-        DistributionInstall.create(subparsers)
-        DistributionCreate.create(subparsers)
-        DistributionConfigure.create(subparsers)
-        DistributionStart.create(subparsers)
--- a/llama_toolchain/cli/distribution/install.py
+++ b/llama_toolchain/cli/distribution/install.py
@ -1,111 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-import argparse
-import os
-
-import pkg_resources
-import yaml
-
-from llama_toolchain.cli.subcommand import Subcommand
-from llama_toolchain.common.config_dirs import DISTRIBS_BASE_DIR
-
-from termcolor import cprint
-
-
-class DistributionInstall(Subcommand):
-    """Llama cli for configuring llama toolchain configs"""
-
-    def __init__(self, subparsers: argparse._SubParsersAction):
-        super().__init__()
-        self.parser = subparsers.add_parser(
-            "install",
-            prog="llama distribution install",
-            description="Install a llama stack distribution",
-            formatter_class=argparse.RawTextHelpFormatter,
-        )
-        self._add_arguments()
-        self.parser.set_defaults(func=self._run_distribution_install_cmd)
-
-    def _add_arguments(self):
-        from llama_toolchain.distribution.registry import available_distribution_specs
-
-        self.parser.add_argument(
-            "--spec",
-            type=str,
-            help="Distribution spec to install (try local-ollama)",
-            required=True,
-            choices=[d.spec_id for d in available_distribution_specs()],
-        )
-        self.parser.add_argument(
-            "--name",
-            type=str,
-            help="What should the installation be called locally?",
-            required=True,
-        )
-        self.parser.add_argument(
-            "--conda-env",
-            type=str,
-            help="conda env in which this distribution will run (default = distribution name)",
-        )
-
-    def _run_distribution_install_cmd(self, args: argparse.Namespace) -> None:
-        from llama_toolchain.common.exec import run_with_pty
-        from llama_toolchain.distribution.datatypes import DistributionConfig
-        from llama_toolchain.distribution.distribution import distribution_dependencies
-        from llama_toolchain.distribution.registry import resolve_distribution_spec
-
-        os.makedirs(DISTRIBS_BASE_DIR, exist_ok=True)
-        script = pkg_resources.resource_filename(
-            "llama_toolchain",
-            "distribution/install_distribution.sh",
-        )
-
-        dist = resolve_distribution_spec(args.spec)
-        if dist is None:
-            self.parser.error(f"Could not find distribution {args.spec}")
-            return
-
-        distrib_dir = DISTRIBS_BASE_DIR / args.name
-        os.makedirs(distrib_dir, exist_ok=True)
-
-        deps = distribution_dependencies(dist)
-        if not args.conda_env:
-            print(f"Using {args.name} as the Conda environment for this distribution")
-
-        conda_env = args.conda_env or args.name
-
-        config_file = distrib_dir / "config.yaml"
-        if config_file.exists():
-            c = DistributionConfig(**yaml.safe_load(config_file.read_text()))
-            if c.spec != dist.spec_id:
-                self.parser.error(
-                    f"already installed distribution with `spec={c.spec}` does not match provided spec `{args.spec}`"
-                )
-                return
-            if c.conda_env != conda_env:
-                self.parser.error(
-                    f"already installed distribution has `conda_env={c.conda_env}` different from provided conda env `{conda_env}`"
-                )
-                return
-        else:
-            with open(config_file, "w") as f:
-                c = DistributionConfig(
-                    spec=dist.spec_id,
-                    name=args.name,
-                    conda_env=conda_env,
-                )
-                f.write(yaml.dump(c.dict(), sort_keys=False))
-
-        return_code = run_with_pty([script, conda_env, args.name, " ".join(deps)])
-
-        assert return_code == 0, cprint(
-            f"Failed to install distribution {dist.spec_id}", color="red"
-        )
-        cprint(
-            f"Distribution `{args.name}` (with spec {dist.spec_id}) has been installed successfully!",
-            color="green",
-        )
--- a/llama_toolchain/cli/distribution/start.py
+++ b/llama_toolchain/cli/distribution/start.py
@ -1,81 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-import argparse
-
-import pkg_resources
-import yaml
-
-from llama_toolchain.cli.subcommand import Subcommand
-from llama_toolchain.common.config_dirs import DISTRIBS_BASE_DIR
-
-
-class DistributionStart(Subcommand):
-    def __init__(self, subparsers: argparse._SubParsersAction):
-        super().__init__()
-        self.parser = subparsers.add_parser(
-            "start",
-            prog="llama distribution start",
-            description="""start the server for a Llama stack distribution. you should have already installed and configured the distribution""",
-            formatter_class=argparse.RawTextHelpFormatter,
-        )
-        self._add_arguments()
-        self.parser.set_defaults(func=self._run_distribution_start_cmd)
-
-    def _add_arguments(self):
-        self.parser.add_argument(
-            "--name",
-            type=str,
-            help="Name of the distribution to start",
-            required=True,
-        )
-        self.parser.add_argument(
-            "--port",
-            type=int,
-            help="Port to run the server on. Defaults to 5000",
-            default=5000,
-        )
-        self.parser.add_argument(
-            "--disable-ipv6",
-            action="store_true",
-            help="Disable IPv6 support",
-            default=False,
-        )
-
-    def _run_distribution_start_cmd(self, args: argparse.Namespace) -> None:
-        from llama_toolchain.common.exec import run_with_pty
-        from llama_toolchain.distribution.registry import resolve_distribution_spec
-
-        config_file = DISTRIBS_BASE_DIR / args.name / "config.yaml"
-        if not config_file.exists():
-            self.parser.error(
-                f"Could not find {config_file}. Please run `llama distribution install` first"
-            )
-            return
-
-        # we need to find the spec from the name
-        with open(config_file, "r") as f:
-            config = yaml.safe_load(f)
-
-        dist = resolve_distribution_spec(config["spec"])
-        if dist is None:
-            raise ValueError(f"Could not find any registered spec `{config['spec']}`")
-
-        conda_env = config["conda_env"]
-        if not conda_env:
-            raise ValueError(
-                f"Could not find Conda environment for distribution `{args.name}`"
-            )
-
-        script = pkg_resources.resource_filename(
-            "llama_toolchain",
-            "distribution/start_distribution.sh",
-        )
-        args = [script, conda_env, config_file, "--port", str(args.port)] + (
-            ["--disable-ipv6"] if args.disable_ipv6 else []
-        )
-
-        run_with_pty(args)
--- a/llama_toolchain/cli/llama.py
+++ b/llama_toolchain/cli/llama.py
@ -6,9 +6,9 @@

 import argparse

-from .distribution import DistributionParser
 from .download import Download
 from .model import ModelParser
+from .stack import StackParser


 class LlamaCLIParser:
@ -29,7 +29,7 @@ class LlamaCLIParser:
        # Add sub-commands
        Download.create(subparsers)
        ModelParser.create(subparsers)
-        DistributionParser.create(subparsers)
+        StackParser.create(subparsers)

        # Import sub-commands from agentic_system if they exist
        try:
--- a/llama_toolchain/cli/model/template.py
+++ b/llama_toolchain/cli/model/template.py
@ -32,6 +32,16 @@ class ModelTemplate(Subcommand):
        self._add_arguments()
        self.parser.set_defaults(func=self._run_model_template_cmd)

+    def _prompt_type(self, value):
+        from llama_models.llama3.api.datatypes import ToolPromptFormat
+
+        try:
+            return ToolPromptFormat(value.lower())
+        except ValueError:
+            raise argparse.ArgumentTypeError(
+                f"{value} is not a valid ToolPromptFormat. Choose from {', '.join(t.value for t in ToolPromptFormat)}"
+            ) from None
+
    def _add_arguments(self):
        self.parser.add_argument(
            "-m",
@ -46,6 +56,18 @@ class ModelTemplate(Subcommand):
            help="Usecase template name (system_message, user_message, assistant_message, tool_message)...",
            required=False,
        )
+        self.parser.add_argument(
+            "--format",
+            type=str,
+            help="ToolPromptFormat (json or function_tag). This flag is used to print the template in a specific formats.",
+            required=False,
+            default="json",
+        )
+        self.parser.add_argument(
+            "--raw",
+            action="store_true",
+            help="If set to true, don't pretty-print into a table. Useful to copy-paste.",
+        )

    def _run_model_template_cmd(self, args: argparse.Namespace) -> None:
        from llama_models.llama3.api.interface import (
@ -56,22 +78,32 @@ class ModelTemplate(Subcommand):
        from llama_toolchain.cli.table import print_table

        if args.name:
-            template, tokens_info = render_jinja_template(args.name)
+            tool_prompt_format = self._prompt_type(args.format)
+            template, tokens_info = render_jinja_template(args.name, tool_prompt_format)
            rendered = ""
            for tok, is_special in tokens_info:
                if is_special:
                    rendered += colored(tok, "yellow", attrs=["bold"])
                else:
                    rendered += tok
-            rendered += "\n"
+
+            if not args.raw:
+                rendered = rendered.replace("\n", "↵\n")
                print_table(
                    [
-                    ("Name", colored(template.template_name, "white", attrs=["bold"])),
+                        (
+                            "Name",
+                            colored(template.template_name, "white", attrs=["bold"]),
+                        ),
                        ("Template", rendered),
                        ("Notes", template.notes),
                    ],
                    separate_rows=True,
                )
+            else:
+                print("Template: ", template.template_name)
+                print("=" * 40)
+                print(rendered)
        else:
            templates = list_jinja_templates()
            headers = ["Role", "Template Name"]
--- a/llama_toolchain/inference/ollama/init.py
+++ b/llama_toolchain/inference/ollama/init.py
@ -4,5 +4,4 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from .config import OllamaImplConfig  # noqa
-from .ollama import get_provider_impl  # noqa
+from .stack import StackParser  # noqa
--- a/llama_toolchain/cli/stack/build.py
+++ b/llama_toolchain/cli/stack/build.py
@ -0,0 +1,133 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+import argparse
+
+from llama_toolchain.cli.subcommand import Subcommand
+from llama_toolchain.core.datatypes import *  # noqa: F403
+
+
+def parse_api_provider_tuples(
+    tuples: str, parser: argparse.ArgumentParser
+) -> Dict[str, ProviderSpec]:
+    from llama_toolchain.core.distribution import api_providers
+
+    all_providers = api_providers()
+
+    deps = {}
+    for dep in tuples.split(","):
+        dep = dep.strip()
+        if not dep:
+            continue
+        api_str, provider = dep.split("=")
+        api = Api(api_str)
+
+        provider = provider.strip()
+        if provider not in all_providers[api]:
+            parser.error(f"Provider `{provider}` is not available for API `{api}`")
+            return
+        deps[api] = all_providers[api][provider]
+
+    return deps
+
+
+class StackBuild(Subcommand):
+    def __init__(self, subparsers: argparse._SubParsersAction):
+        super().__init__()
+        self.parser = subparsers.add_parser(
+            "build",
+            prog="llama stack build",
+            description="Build a Llama stack container",
+            formatter_class=argparse.RawTextHelpFormatter,
+        )
+        self._add_arguments()
+        self.parser.set_defaults(func=self._run_stack_build_command)
+
+    def _add_arguments(self):
+        from llama_toolchain.core.distribution_registry import available_distribution_specs
+        from llama_toolchain.core.package import (
+            BuildType,
+        )
+
+        allowed_ids = [d.distribution_id for d in available_distribution_specs()]
+        self.parser.add_argument(
+            "distribution",
+            type=str,
+            help="Distribution to build (either \"adhoc\" OR one of: {})".format(allowed_ids),
+        )
+        self.parser.add_argument(
+            "api_providers",
+            nargs='?',
+            help="Comma separated list of (api=provider) tuples",
+        )
+
+        self.parser.add_argument(
+            "--name",
+            type=str,
+            help="Name of the build target (image, conda env)",
+            required=True,
+        )
+        self.parser.add_argument(
+            "--type",
+            type=str,
+            default="conda_env",
+            choices=[v.value for v in BuildType],
+        )
+
+    def _run_stack_build_command(self, args: argparse.Namespace) -> None:
+        from llama_toolchain.core.distribution_registry import resolve_distribution_spec
+        from llama_toolchain.core.package import (
+            ApiInput,
+            BuildType,
+            build_package,
+        )
+
+        api_inputs = []
+        if args.distribution == "adhoc":
+            if not args.api_providers:
+                self.parser.error("You must specify API providers with (api=provider,...) for building an adhoc distribution")
+                return
+
+            parsed = parse_api_provider_tuples(args.api_providers, self.parser)
+            for api, provider_spec in parsed.items():
+                for dep in provider_spec.api_dependencies:
+                    if dep not in parsed:
+                        self.parser.error(f"API {api} needs dependency {dep} provided also")
+                        return
+
+                api_inputs.append(
+                    ApiInput(
+                        api=api,
+                        provider=provider_spec.provider_id,
+                    )
+                )
+            docker_image = None
+        else:
+            if args.api_providers:
+                self.parser.error("You cannot specify API providers for pre-registered distributions")
+                return
+
+            dist = resolve_distribution_spec(args.distribution)
+            if dist is None:
+                self.parser.error(f"Could not find distribution {args.distribution}")
+                return
+
+            for api, provider_id in dist.providers.items():
+                api_inputs.append(
+                    ApiInput(
+                        api=api,
+                        provider=provider_id,
+                    )
+                )
+            docker_image = dist.docker_image
+
+        build_package(
+            api_inputs,
+            build_type=BuildType(args.type),
+            name=args.name,
+            distribution_id=args.distribution,
+            docker_image=docker_image,
+        )
--- a/llama_toolchain/cli/stack/configure.py
+++ b/llama_toolchain/cli/stack/configure.py
@ -0,0 +1,106 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+import argparse
+import json
+from pathlib import Path
+
+import yaml
+from termcolor import cprint
+
+from llama_toolchain.cli.subcommand import Subcommand
+from llama_toolchain.common.config_dirs import BUILDS_BASE_DIR
+from llama_toolchain.core.datatypes import *  # noqa: F403
+
+
+class StackConfigure(Subcommand):
+    """Llama cli for configuring llama toolchain configs"""
+
+    def __init__(self, subparsers: argparse._SubParsersAction):
+        super().__init__()
+        self.parser = subparsers.add_parser(
+            "configure",
+            prog="llama stack configure",
+            description="configure a llama stack distribution",
+            formatter_class=argparse.RawTextHelpFormatter,
+        )
+        self._add_arguments()
+        self.parser.set_defaults(func=self._run_stack_configure_cmd)
+
+    def _add_arguments(self):
+        from llama_toolchain.core.distribution_registry import (
+            available_distribution_specs,
+        )
+        from llama_toolchain.core.package import BuildType
+
+        allowed_ids = [d.distribution_id for d in available_distribution_specs()]
+        self.parser.add_argument(
+            "distribution",
+            type=str,
+            choices=allowed_ids,
+            help="Distribution (one of: {})".format(allowed_ids),
+        )
+        self.parser.add_argument(
+            "--name",
+            type=str,
+            help="Name of the build",
+            required=True,
+        )
+        self.parser.add_argument(
+            "--type",
+            type=str,
+            default="conda_env",
+            choices=[v.value for v in BuildType],
+        )
+
+    def _run_stack_configure_cmd(self, args: argparse.Namespace) -> None:
+        from llama_toolchain.core.package import BuildType
+
+        build_type = BuildType(args.type)
+        name = args.name
+        config_file = (
+            BUILDS_BASE_DIR
+            / args.distribution
+            / build_type.descriptor()
+            / f"{name}.yaml"
+        )
+        if not config_file.exists():
+            self.parser.error(
+                f"Could not find {config_file}. Please run `llama stack build` first"
+            )
+            return
+
+        configure_llama_distribution(config_file)
+
+
+def configure_llama_distribution(config_file: Path) -> None:
+    from llama_toolchain.common.serialize import EnumEncoder
+    from llama_toolchain.core.configure import configure_api_providers
+    from llama_toolchain.core.distribution_registry import resolve_distribution_spec
+
+    with open(config_file, "r") as f:
+        config = PackageConfig(**yaml.safe_load(f))
+
+    dist = resolve_distribution_spec(config.distribution_id)
+    if dist is None:
+        raise ValueError(
+            f"Could not find any registered distribution `{config.distribution_id}`"
+        )
+
+    if config.providers:
+        cprint(
+            f"Configuration already exists for {config.distribution_id}. Will overwrite...",
+            "yellow",
+            attrs=["bold"],
+        )
+
+    config.providers = configure_api_providers(config.providers)
+
+    with open(config_file, "w") as fp:
+        to_write = json.loads(json.dumps(config.dict(), cls=EnumEncoder))
+        fp.write(yaml.dump(to_write, sort_keys=False))
+
+    print(f"YAML configuration has been written to {config_file}")
--- a/llama_toolchain/cli/distribution/list.py
+++ b/llama_toolchain/cli/distribution/list.py
@ -10,13 +10,13 @@ import json
 from llama_toolchain.cli.subcommand import Subcommand


-class DistributionList(Subcommand):
+class StackList(Subcommand):
    def __init__(self, subparsers: argparse._SubParsersAction):
        super().__init__()
        self.parser = subparsers.add_parser(
-            "list",
-            prog="llama distribution list",
-            description="Show available llama stack distributions",
+            "list-distributions",
+            prog="llama stack list-distributions",
+            description="Show available Llama Stack Distributions",
            formatter_class=argparse.RawTextHelpFormatter,
        )
        self._add_arguments()
@ -27,21 +27,23 @@ class DistributionList(Subcommand):

    def _run_distribution_list_cmd(self, args: argparse.Namespace) -> None:
        from llama_toolchain.cli.table import print_table
-        from llama_toolchain.distribution.registry import available_distribution_specs
+        from llama_toolchain.core.distribution_registry import (
+            available_distribution_specs,
+        )

        # eventually, this should query a registry at llama.meta.com/llamastack/distributions
        headers = [
-            "Spec ID",
-            "ProviderSpecs",
+            "Distribution ID",
+            "Providers",
            "Description",
        ]

        rows = []
        for spec in available_distribution_specs():
-            providers = {k.value: v.provider_id for k, v in spec.provider_specs.items()}
+            providers = {k.value: v for k, v in spec.providers.items()}
            rows.append(
                [
-                    spec.spec_id,
+                    spec.distribution_id,
                    json.dumps(providers, indent=2),
                    spec.description,
                ]
--- a/llama_toolchain/cli/stack/run.py
+++ b/llama_toolchain/cli/stack/run.py
@ -0,0 +1,106 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+import argparse
+
+from pathlib import Path
+
+import pkg_resources
+import yaml
+
+from llama_toolchain.cli.subcommand import Subcommand
+from llama_toolchain.core.datatypes import *  # noqa: F403
+from llama_toolchain.common.config_dirs import BUILDS_BASE_DIR
+
+
+class StackRun(Subcommand):
+    def __init__(self, subparsers: argparse._SubParsersAction):
+        super().__init__()
+        self.parser = subparsers.add_parser(
+            "run",
+            prog="llama stack run",
+            description="""start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.""",
+            formatter_class=argparse.RawTextHelpFormatter,
+        )
+        self._add_arguments()
+        self.parser.set_defaults(func=self._run_stack_run_cmd)
+
+    def _add_arguments(self):
+        from llama_toolchain.core.package import BuildType
+
+        self.parser.add_argument(
+            "distribution",
+            type=str,
+            help="Distribution whose build you want to start",
+        )
+        self.parser.add_argument(
+            "--name",
+            type=str,
+            help="Name of the build you want to start",
+            required=True,
+        )
+        self.parser.add_argument(
+            "--type",
+            type=str,
+            default="conda_env",
+            choices=[v.value for v in BuildType],
+        )
+        self.parser.add_argument(
+            "--port",
+            type=int,
+            help="Port to run the server on. Defaults to 5000",
+            default=5000,
+        )
+        self.parser.add_argument(
+            "--disable-ipv6",
+            action="store_true",
+            help="Disable IPv6 support",
+            default=False,
+        )
+
+    def _run_stack_run_cmd(self, args: argparse.Namespace) -> None:
+        from llama_toolchain.common.exec import run_with_pty
+        from llama_toolchain.core.package import BuildType
+
+        build_type = BuildType(args.type)
+        build_dir = BUILDS_BASE_DIR / args.distribution / build_type.descriptor()
+        path = build_dir / f"{args.name}.yaml"
+
+        config_file = Path(path)
+
+        if not config_file.exists():
+            self.parser.error(
+                f"File {str(config_file)} does not exist. Did you run `llama stack build`?"
+            )
+            return
+
+        with open(config_file, "r") as f:
+            config = PackageConfig(**yaml.safe_load(f))
+
+        if not config.distribution_id:
+            raise ValueError("Build config appears to be corrupt.")
+
+        if config.docker_image:
+            script = pkg_resources.resource_filename(
+                "llama_toolchain",
+                "core/start_container.sh",
+            )
+            run_args = [script, config.docker_image]
+        else:
+            script = pkg_resources.resource_filename(
+                "llama_toolchain",
+                "core/start_conda_env.sh",
+            )
+            run_args = [
+                script,
+                config.conda_env,
+            ]
+
+        run_args.extend([str(config_file), str(args.port)])
+        if args.disable_ipv6:
+            run_args.append("--disable-ipv6")
+
+        run_with_pty(run_args)
--- a/llama_toolchain/cli/stack/stack.py
+++ b/llama_toolchain/cli/stack/stack.py
@ -0,0 +1,32 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+import argparse
+
+from llama_toolchain.cli.subcommand import Subcommand
+
+from .build import StackBuild
+from .configure import StackConfigure
+from .list import StackList
+from .run import StackRun
+
+
+class StackParser(Subcommand):
+    def __init__(self, subparsers: argparse._SubParsersAction):
+        super().__init__()
+        self.parser = subparsers.add_parser(
+            "stack",
+            prog="llama stack",
+            description="Operations for the Llama Stack / Distributions",
+        )
+
+        subparsers = self.parser.add_subparsers(title="stack_subcommands")
+
+        # Add sub-commands
+        StackBuild.create(subparsers)
+        StackConfigure.create(subparsers)
+        StackList.create(subparsers)
+        StackRun.create(subparsers)
--- a/llama_toolchain/common/config_dirs.py
+++ b/llama_toolchain/common/config_dirs.py
@ -13,3 +13,5 @@ LLAMA_STACK_CONFIG_DIR = Path(os.path.expanduser("~/.llama/"))
 DISTRIBS_BASE_DIR = LLAMA_STACK_CONFIG_DIR / "distributions"

 DEFAULT_CHECKPOINT_DIR = LLAMA_STACK_CONFIG_DIR / "checkpoints"
+
+BUILDS_BASE_DIR = LLAMA_STACK_CONFIG_DIR / "builds"
--- a/llama_toolchain/common/serialize.py
+++ b/llama_toolchain/common/serialize.py
@ -5,6 +5,7 @@
 # the root directory of this source tree.

 import json
+from datetime import datetime
 from enum import Enum


@ -12,4 +13,6 @@ class EnumEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, Enum):
            return obj.value
+        elif isinstance(obj, datetime):
+            return obj.isoformat()
        return super().default(obj)
--- a/llama_toolchain/agentic_system/meta_reference/tools/ipython_tool/init.py
+++ b/llama_toolchain/agentic_system/meta_reference/tools/ipython_tool/init.py
--- a/llama_toolchain/distribution/install_distribution.sh
+++ b/llama_toolchain/distribution/install_distribution.sh
@ -10,20 +10,36 @@ LLAMA_MODELS_DIR=${LLAMA_MODELS_DIR:-}
 LLAMA_TOOLCHAIN_DIR=${LLAMA_TOOLCHAIN_DIR:-}
 TEST_PYPI_VERSION=${TEST_PYPI_VERSION:-}

+if [ -n "$LLAMA_TOOLCHAIN_DIR" ]; then
+  echo "Using llama-toolchain-dir=$LLAMA_TOOLCHAIN_DIR"
+fi
+if [ -n "$LLAMA_MODELS_DIR" ]; then
+  echo "Using llama-models-dir=$LLAMA_MODELS_DIR"
+fi
+
 set -euo pipefail

+if [ "$#" -ne 3 ]; then
+  echo "Usage: $0 <distribution_id> <build_name> <pip_dependencies>" >&2
+  echo "Example: $0 <distribution_id> mybuild 'numpy pandas scipy'" >&2
+  exit 1
+fi
+
+distribution_id="$1"
+build_name="$2"
+env_name="llamastack-$build_name"
+pip_dependencies="$3"
+
 # Define color codes
 RED='\033[0;31m'
 GREEN='\033[0;32m'
 NC='\033[0m' # No Color

-error_handler() {
-  echo "Error occurred in script at line: ${1}" >&2
-  exit 1
-}
+# this is set if we actually create a new conda in which case we need to clean up
+ENVNAME=""

-# Set up the error trap
-trap 'error_handler ${LINENO}' ERR
+SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
+source "$SCRIPT_DIR/common.sh"

 ensure_conda_env_python310() {
  local env_name="$1"
@ -32,26 +48,29 @@ ensure_conda_env_python310() {

  # Check if conda command is available
  if ! command -v conda &>/dev/null; then
-    echo -e "${RED}Error: conda command not found. Is Conda installed and in your PATH?${NC}" >&2
+    printf "${RED}Error: conda command not found. Is Conda installed and in your PATH?${NC}" >&2
    exit 1
  fi

  # Check if the environment exists
  if conda env list | grep -q "^${env_name} "; then
-    echo "Conda environment '${env_name}' exists. Checking Python version..."
+    printf "Conda environment '${env_name}' exists. Checking Python version...\n"

    # Check Python version in the environment
    current_version=$(conda run -n "${env_name}" python --version 2>&1 | cut -d' ' -f2 | cut -d'.' -f1,2)

    if [ "$current_version" = "$python_version" ]; then
-      echo "Environment '${env_name}' already has Python ${python_version}. No action needed."
+      printf "Environment '${env_name}' already has Python ${python_version}. No action needed.\n"
    else
-      echo "Updating environment '${env_name}' to Python ${python_version}..."
+      printf "Updating environment '${env_name}' to Python ${python_version}...\n"
      conda install -n "${env_name}" python="${python_version}" -y
    fi
  else
-    echo "Conda environment '${env_name}' does not exist. Creating with Python ${python_version}..."
+    printf "Conda environment '${env_name}' does not exist. Creating with Python ${python_version}...\n"
    conda create -n "${env_name}" python="${python_version}" -y
+
+    ENVNAME="${env_name}"
+    # setup_cleanup_handlers
  fi

  eval "$(conda shell.bash hook)"
@ -65,48 +84,45 @@ ensure_conda_env_python310() {
    # Re-installing llama-toolchain in the new conda environment
    if [ -n "$LLAMA_TOOLCHAIN_DIR" ]; then
      if [ ! -d "$LLAMA_TOOLCHAIN_DIR" ]; then
-        echo -e "${RED}Warning: LLAMA_TOOLCHAIN_DIR is set but directory does not exist: $LLAMA_TOOLCHAIN_DIR${NC}" >&2
+        printf "${RED}Warning: LLAMA_TOOLCHAIN_DIR is set but directory does not exist: $LLAMA_TOOLCHAIN_DIR${NC}\n" >&2
        exit 1
      fi

-      echo "Installing from LLAMA_TOOLCHAIN_DIR: $LLAMA_TOOLCHAIN_DIR"
-      pip install -e "$LLAMA_TOOLCHAIN_DIR"
+      printf "Installing from LLAMA_TOOLCHAIN_DIR: $LLAMA_TOOLCHAIN_DIR\n"
+      pip install --no-cache-dir -e "$LLAMA_TOOLCHAIN_DIR"
    else
-      pip install llama-toolchain
+      pip install --no-cache-dir llama-toolchain
    fi

    if [ -n "$LLAMA_MODELS_DIR" ]; then
      if [ ! -d "$LLAMA_MODELS_DIR" ]; then
-        echo -e "${RED}Warning: LLAMA_MODELS_DIR is set but directory does not exist: $LLAMA_MODELS_DIR${NC}" >&2
+        printf "${RED}Warning: LLAMA_MODELS_DIR is set but directory does not exist: $LLAMA_MODELS_DIR${NC}\n" >&2
        exit 1
      fi

-      echo "Installing from LLAMA_MODELS_DIR: $LLAMA_MODELS_DIR"
+      printf "Installing from LLAMA_MODELS_DIR: $LLAMA_MODELS_DIR\n"
      pip uninstall -y llama-models
-      pip install -e "$LLAMA_MODELS_DIR"
+      pip install --no-cache-dir -e "$LLAMA_MODELS_DIR"
    fi

    # Install pip dependencies
    if [ -n "$pip_dependencies" ]; then
-      echo "Installing pip dependencies: $pip_dependencies"
+      printf "Installing pip dependencies: $pip_dependencies\n"
      pip install $pip_dependencies
    fi
  fi
 }

-if [ "$#" -ne 3 ]; then
-  echo "Usage: $0 <environment_name> <distribution_name> <pip_dependencies>" >&2
-  echo "Example: $0 my_env local-llama-8b 'numpy pandas scipy'" >&2
-  exit 1
-fi
-
-env_name="$1"
-distribution_name="$2"
-pip_dependencies="$3"
-
 ensure_conda_env_python310 "$env_name" "$pip_dependencies"

-echo -e "${GREEN}Successfully setup distribution environment. Configuring...${NC}"
+printf "${GREEN}Successfully setup conda environment. Configuring build...${NC}\n"

-which python3
-python3 -m llama_toolchain.cli.llama distribution configure --name "$distribution_name"
+if [ "$distribution_id" = "adhoc" ]; then
+  subcommand="api"
+  target=""
+else
+  subcommand="stack"
+  target="$distribution_id"
+fi
+
+$CONDA_PREFIX/bin/python3 -m llama_toolchain.cli.llama $subcommand configure $target --name "$build_name" --type conda_env
--- a/llama_toolchain/core/build_container.sh
+++ b/llama_toolchain/core/build_container.sh
@ -0,0 +1,120 @@
+#!/bin/bash
+
+LLAMA_MODELS_DIR=${LLAMA_MODELS_DIR:-}
+LLAMA_TOOLCHAIN_DIR=${LLAMA_TOOLCHAIN_DIR:-}
+TEST_PYPI_VERSION=${TEST_PYPI_VERSION:-}
+
+if [ "$#" -ne 4 ]; then
+  echo "Usage: $0 <distribution_id> <build_name> <docker_base> <pip_dependencies>
+  echo "Example: $0 distribution_id my-fastapi-app python:3.9-slim 'fastapi uvicorn'
+  exit 1
+fi
+
+distribution_id=$1
+build_name="$2"
+image_name="llamastack-$build_name"
+docker_base=$3
+pip_dependencies=$4
+
+# Define color codes
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+NC='\033[0m' # No Color
+
+set -euo pipefail
+
+SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
+REPO_DIR=$(dirname $(dirname "$SCRIPT_DIR"))
+
+TEMP_DIR=$(mktemp -d)
+
+add_to_docker() {
+  local input
+  output_file="$TEMP_DIR/Dockerfile"
+  if [ -t 0 ]; then
+    printf '%s\n' "$1" >>"$output_file"
+  else
+    # If stdin is not a terminal, read from it (heredoc)
+    cat >>"$output_file"
+  fi
+}
+
+add_to_docker <<EOF
+FROM $docker_base
+WORKDIR /app
+
+RUN apt-get update && apt-get install -y \
+       iputils-ping net-tools iproute2 dnsutils telnet \
+       curl wget telnet \
+       procps psmisc lsof \
+       traceroute \
+       && rm -rf /var/lib/apt/lists/*
+
+EOF
+
+toolchain_mount="/app/llama-toolchain-source"
+models_mount="/app/llama-models-source"
+
+if [ -n "$LLAMA_TOOLCHAIN_DIR" ]; then
+  if [ ! -d "$LLAMA_TOOLCHAIN_DIR" ]; then
+    echo "${RED}Warning: LLAMA_TOOLCHAIN_DIR is set but directory does not exist: $LLAMA_TOOLCHAIN_DIR${NC}" >&2
+    exit 1
+  fi
+  add_to_docker "RUN pip install $toolchain_mount"
+else
+  add_to_docker "RUN pip install llama-toolchain"
+fi
+
+if [ -n "$LLAMA_MODELS_DIR" ]; then
+  if [ ! -d "$LLAMA_MODELS_DIR" ]; then
+    echo "${RED}Warning: LLAMA_MODELS_DIR is set but directory does not exist: $LLAMA_MODELS_DIR${NC}" >&2
+    exit 1
+  fi
+
+  add_to_docker <<EOF
+RUN pip uninstall -y llama-models
+RUN pip install $models_mount
+
+EOF
+fi
+
+if [ -n "$pip_dependencies" ]; then
+  add_to_docker "RUN pip install $pip_dependencies"
+fi
+
+add_to_docker <<EOF
+
+# This would be good in production but for debugging flexibility lets not add it right now
+# We need a more solid production ready entrypoint.sh anyway
+#
+# ENTRYPOINT ["python", "-m", "llama_toolchain.core.server"]
+
+EOF
+
+printf "Dockerfile created successfully in $TEMP_DIR/Dockerfile"
+cat $TEMP_DIR/Dockerfile
+printf "\n"
+
+mounts=""
+if [ -n "$LLAMA_TOOLCHAIN_DIR" ]; then
+  mounts="$mounts -v $(readlink -f $LLAMA_TOOLCHAIN_DIR):$toolchain_mount"
+fi
+if [ -n "$LLAMA_MODELS_DIR" ]; then
+  mounts="$mounts -v $(readlink -f $LLAMA_MODELS_DIR):$models_mount"
+fi
+set -x
+podman build -t $image_name -f "$TEMP_DIR/Dockerfile" "$REPO_DIR" $mounts
+set +x
+
+printf "${GREEN}Succesfully setup Podman image. Configuring build...${NC}"
+echo "You can run it with: podman run -p 8000:8000 $image_name"
+
+if [ "$distribution_id" = "adhoc" ]; then
+  subcommand="api"
+  target=""
+else
+  subcommand="stack"
+  target="$distribution_id"
+fi
+
+$CONDA_PREFIX/bin/python3 -m llama_toolchain.cli.llama $subcommand configure $target --name "$build_name" --type container
--- a/llama_toolchain/core/common.sh
+++ b/llama_toolchain/core/common.sh
@ -0,0 +1,40 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+cleanup() {
+  envname="$1"
+
+  set +x
+  echo "Cleaning up..."
+  conda deactivate
+  conda env remove --name $envname -y
+}
+
+handle_int() {
+  if [ -n $ENVNAME ]; then
+    cleanup $ENVNAME
+  fi
+  exit 1
+}
+
+handle_exit() {
+  if [ $? -ne 0 ]; then
+    echo -e "\033[1;31mABORTING.\033[0m"
+    if [ -n $ENVNAME ]; then
+      cleanup $ENVNAME
+    fi
+  fi
+}
+
+setup_cleanup_handlers() {
+  trap handle_int INT
+  trap handle_exit EXIT
+
+  __conda_setup="$('conda' 'shell.bash' 'hook' 2>/dev/null)"
+  eval "$__conda_setup"
+
+  conda deactivate
+}
--- a/llama_toolchain/core/configure.py
+++ b/llama_toolchain/core/configure.py
@ -0,0 +1,50 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from typing import Any, Dict
+
+from llama_toolchain.core.datatypes import *  # noqa: F403
+from termcolor import cprint
+
+from llama_toolchain.common.prompt_for_config import prompt_for_config
+from llama_toolchain.core.distribution import api_providers
+from llama_toolchain.core.dynamic import instantiate_class_type
+
+
+def configure_api_providers(existing_configs: Dict[str, Any]) -> None:
+    all_providers = api_providers()
+
+    provider_configs = {}
+    for api_str, stub_config in existing_configs.items():
+        api = Api(api_str)
+        providers = all_providers[api]
+        provider_id = stub_config["provider_id"]
+        if provider_id not in providers:
+            raise ValueError(
+                f"Unknown provider `{provider_id}` is not available for API `{api_str}`"
+            )
+
+        provider_spec = providers[provider_id]
+        cprint(f"Configuring API: {api_str} ({provider_id})", "white", attrs=["bold"])
+        config_type = instantiate_class_type(provider_spec.config_class)
+
+        try:
+            existing_provider_config = config_type(**stub_config)
+        except Exception:
+            existing_provider_config = None
+
+        provider_config = prompt_for_config(
+            config_type,
+            existing_provider_config,
+        )
+        print("")
+
+        provider_configs[api_str] = {
+            "provider_id": provider_id,
+            **provider_config.dict(),
+        }
+
+    return provider_configs
--- a/llama_toolchain/core/datatypes.py
+++ b/llama_toolchain/core/datatypes.py
@ -0,0 +1,190 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from datetime import datetime
+from enum import Enum
+from typing import Any, Dict, List, Optional
+
+from llama_models.schema_utils import json_schema_type
+
+from pydantic import BaseModel, Field, validator
+
+
+@json_schema_type
+class Api(Enum):
+    inference = "inference"
+    safety = "safety"
+    agentic_system = "agentic_system"
+    memory = "memory"
+
+
+@json_schema_type
+class ApiEndpoint(BaseModel):
+    route: str
+    method: str
+    name: str
+
+
+@json_schema_type
+class ProviderSpec(BaseModel):
+    api: Api
+    provider_id: str
+    config_class: str = Field(
+        ...,
+        description="Fully-qualified classname of the config for this provider",
+    )
+    api_dependencies: List[Api] = Field(
+        default_factory=list,
+        description="Higher-level API surfaces may depend on other providers to provide their functionality",
+    )
+
+
+@json_schema_type
+class AdapterSpec(BaseModel):
+    adapter_id: str = Field(
+        ...,
+        description="Unique identifier for this adapter",
+    )
+    module: str = Field(
+        ...,
+        description="""
+Fully-qualified name of the module to import. The module is expected to have:
+
+ - `get_adapter_impl(config, deps)`: returns the adapter implementation
+""",
+    )
+    pip_packages: List[str] = Field(
+        default_factory=list,
+        description="The pip dependencies needed for this implementation",
+    )
+    config_class: Optional[str] = Field(
+        default=None,
+        description="Fully-qualified classname of the config for this provider",
+    )
+
+
+@json_schema_type
+class InlineProviderSpec(ProviderSpec):
+    pip_packages: List[str] = Field(
+        default_factory=list,
+        description="The pip dependencies needed for this implementation",
+    )
+    docker_image: Optional[str] = Field(
+        default=None,
+        description="""
+The docker image to use for this implementation. If one is provided, pip_packages will be ignored.
+If a provider depends on other providers, the dependencies MUST NOT specify a docker image.
+""",
+    )
+    module: str = Field(
+        ...,
+        description="""
+Fully-qualified name of the module to import. The module is expected to have:
+
+ - `get_provider_impl(config, deps)`: returns the local implementation
+""",
+    )
+
+
+class RemoteProviderConfig(BaseModel):
+    url: str = Field(..., description="The URL for the provider")
+
+    @validator("url")
+    @classmethod
+    def validate_url(cls, url: str) -> str:
+        if not url.startswith("http"):
+            raise ValueError(f"URL must start with http: {url}")
+        return url.rstrip("/")
+
+
+def remote_provider_id(adapter_id: str) -> str:
+    return f"remote::{adapter_id}"
+
+
+@json_schema_type
+class RemoteProviderSpec(ProviderSpec):
+    adapter: Optional[AdapterSpec] = Field(
+        default=None,
+        description="""
+If some code is needed to convert the remote responses into Llama Stack compatible
+API responses, specify the adapter here. If not specified, it indicates the remote
+as being "Llama Stack compatible"
+""",
+    )
+
+    @property
+    def docker_image(self) -> Optional[str]:
+        return None
+
+    @property
+    def module(self) -> str:
+        if self.adapter:
+            return self.adapter.module
+        return f"llama_toolchain.{self.api.value}.client"
+
+    @property
+    def pip_packages(self) -> List[str]:
+        if self.adapter:
+            return self.adapter.pip_packages
+        return []
+
+
+# Can avoid this by using Pydantic computed_field
+def remote_provider_spec(
+    api: Api, adapter: Optional[AdapterSpec] = None
+) -> RemoteProviderSpec:
+    config_class = (
+        adapter.config_class
+        if adapter and adapter.config_class
+        else "llama_toolchain.core.datatypes.RemoteProviderConfig"
+    )
+    provider_id = remote_provider_id(adapter.adapter_id) if adapter else "remote"
+
+    return RemoteProviderSpec(
+        api=api, provider_id=provider_id, config_class=config_class, adapter=adapter
+    )
+
+
+@json_schema_type
+class DistributionSpec(BaseModel):
+    distribution_id: str
+    description: str
+
+    docker_image: Optional[str] = None
+    providers: Dict[Api, str] = Field(
+        default_factory=dict,
+        description="Provider IDs for each of the APIs provided by this distribution",
+    )
+
+
+@json_schema_type
+class PackageConfig(BaseModel):
+    built_at: datetime
+
+    package_name: str = Field(
+        ...,
+        description="""
+Reference to the distribution this package refers to. For unregistered (adhoc) packages,
+this could be just a hash
+""",
+    )
+    distribution_id: Optional[str] = None
+
+    docker_image: Optional[str] = Field(
+        default=None,
+        description="Reference to the docker image if this package refers to a container",
+    )
+    conda_env: Optional[str] = Field(
+        default=None,
+        description="Reference to the conda environment if this package refers to a conda environment",
+    )
+    providers: Dict[str, Any] = Field(
+        default_factory=dict,
+        description="""
+Provider configurations for each of the APIs provided by this package. This includes configurations for
+the dependencies of these providers as well.
+""",
+    )
--- a/llama_toolchain/distribution/distribution.py
+++ b/llama_toolchain/distribution/distribution.py
@ -7,11 +7,13 @@
 import inspect
 from typing import Dict, List

-from llama_toolchain.agentic_system.api.endpoints import AgenticSystem
+from llama_toolchain.agentic_system.api import AgenticSystem
 from llama_toolchain.agentic_system.providers import available_agentic_system_providers
-from llama_toolchain.inference.api.endpoints import Inference
+from llama_toolchain.inference.api import Inference
 from llama_toolchain.inference.providers import available_inference_providers
-from llama_toolchain.safety.api.endpoints import Safety
+from llama_toolchain.memory.api import Memory
+from llama_toolchain.memory.providers import available_memory_providers
+from llama_toolchain.safety.api import Safety
 from llama_toolchain.safety.providers import available_safety_providers

 from .datatypes import (
@ -20,6 +22,7 @@ from .datatypes import (
    DistributionSpec,
    InlineProviderSpec,
    ProviderSpec,
+    remote_provider_spec,
 )

 # These are the dependencies needed by the distribution server.
@ -40,6 +43,10 @@ def distribution_dependencies(distribution: DistributionSpec) -> List[str]:
    ] + SERVER_DEPENDENCIES


+def stack_apis() -> List[Api]:
+    return [Api.inference, Api.safety, Api.agentic_system, Api.memory]
+
+
 def api_endpoints() -> Dict[Api, List[ApiEndpoint]]:
    apis = {}

@ -47,6 +54,7 @@ def api_endpoints() -> Dict[Api, List[ApiEndpoint]]:
        Api.inference: Inference,
        Api.safety: Safety,
        Api.agentic_system: AgenticSystem,
+        Api.memory: Memory,
    }

    for api, protocol in protocols.items():
@ -60,9 +68,13 @@ def api_endpoints() -> Dict[Api, List[ApiEndpoint]]:
            webmethod = method.__webmethod__
            route = webmethod.route

-            # use `post` for all methods right now until we fix up the `webmethod` openapi
-            # annotation and write our own openapi generator
-            endpoints.append(ApiEndpoint(route=route, method="post", name=name))
+            if webmethod.method == "GET":
+                method = "get"
+            elif webmethod.method == "DELETE":
+                method = "delete"
+            else:
+                method = "post"
+            endpoints.append(ApiEndpoint(route=route, method=method, name=name))

        apis[api] = endpoints

@ -78,8 +90,12 @@ def api_providers() -> Dict[Api, Dict[str, ProviderSpec]]:
        a.provider_id: a for a in available_agentic_system_providers()
    }

-    return {
+    ret = {
        Api.inference: inference_providers_by_id,
        Api.safety: safety_providers_by_id,
        Api.agentic_system: agentic_system_providers_by_id,
+        Api.memory: {a.provider_id: a for a in available_memory_providers()},
    }
+    for k, v in ret.items():
+        v["remote"] = remote_provider_spec(k)
+    return ret
--- a/llama_toolchain/core/distribution_registry.py
+++ b/llama_toolchain/core/distribution_registry.py
@ -0,0 +1,69 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from functools import lru_cache
+from typing import List, Optional
+
+from .datatypes import *  # noqa: F403
+
+
+@lru_cache()
+def available_distribution_specs() -> List[DistributionSpec]:
+    return [
+        DistributionSpec(
+            distribution_id="local",
+            description="Use code from `llama_toolchain` itself to serve all llama stack APIs",
+            providers={
+                Api.inference: "meta-reference",
+                Api.memory: "meta-reference-faiss",
+                Api.safety: "meta-reference",
+                Api.agentic_system: "meta-reference",
+            },
+        ),
+        DistributionSpec(
+            distribution_id="remote",
+            description="Point to remote services for all llama stack APIs",
+            providers={x: "remote" for x in Api},
+        ),
+        DistributionSpec(
+            distribution_id="local-ollama",
+            description="Like local, but use ollama for running LLM inference",
+            providers={
+                Api.inference: remote_provider_id("ollama"),
+                Api.safety: "meta-reference",
+                Api.agentic_system: "meta-reference",
+                Api.memory: "meta-reference-faiss",
+            },
+        ),
+        DistributionSpec(
+            distribution_id="local-plus-fireworks-inference",
+            description="Use Fireworks.ai for running LLM inference",
+            providers={
+                Api.inference: remote_provider_id("fireworks"),
+                Api.safety: "meta-reference",
+                Api.agentic_system: "meta-reference",
+                Api.memory: "meta-reference-faiss",
+            },
+        ),
+        DistributionSpec(
+            distribution_id="local-plus-together-inference",
+            description="Use Together.ai for running LLM inference",
+            providers={
+                Api.inference: remote_provider_id("together"),
+                Api.safety: "meta-reference",
+                Api.agentic_system: "meta-reference",
+                Api.memory: "meta-reference-faiss",
+            },
+        ),
+    ]
+
+
+@lru_cache()
+def resolve_distribution_spec(distribution_id: str) -> Optional[DistributionSpec]:
+    for spec in available_distribution_specs():
+        if spec.distribution_id == distribution_id:
+            return spec
+    return None
--- a/llama_toolchain/distribution/dynamic.py
+++ b/llama_toolchain/distribution/dynamic.py
@ -8,7 +8,7 @@ import asyncio
 import importlib
 from typing import Any, Dict

-from .datatypes import InlineProviderSpec, ProviderSpec, RemoteProviderSpec
+from .datatypes import ProviderSpec, RemoteProviderSpec


 def instantiate_class_type(fully_qualified_name):
@ -19,18 +19,24 @@ def instantiate_class_type(fully_qualified_name):

 # returns a class implementing the protocol corresponding to the Api
 def instantiate_provider(
-    provider_spec: InlineProviderSpec,
+    provider_spec: ProviderSpec,
    provider_config: Dict[str, Any],
    deps: Dict[str, ProviderSpec],
 ):
    module = importlib.import_module(provider_spec.module)

    config_type = instantiate_class_type(provider_spec.config_class)
+    if isinstance(provider_spec, RemoteProviderSpec):
+        if provider_spec.adapter:
+            method = "get_adapter_impl"
+        else:
+            method = "get_client_impl"
+    else:
+        method = "get_provider_impl"
+
    config = config_type(**provider_config)
-    return asyncio.run(module.get_provider_impl(config, deps))
-
-
-def instantiate_client(provider_spec: RemoteProviderSpec, base_url: str):
-    module = importlib.import_module(provider_spec.module)
-
-    return asyncio.run(module.get_client_impl(base_url))
+    fn = getattr(module, method)
+    impl = asyncio.run(fn(config, deps))
+    impl.__provider_spec__ = provider_spec
+    impl.__provider_config__ = config
+    return impl
--- a/llama_toolchain/core/package.py
+++ b/llama_toolchain/core/package.py
@ -0,0 +1,149 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+import json
+import os
+from datetime import datetime
+from enum import Enum
+from typing import List, Optional
+
+import pkg_resources
+import yaml
+from pydantic import BaseModel
+
+from termcolor import cprint
+
+from llama_toolchain.common.config_dirs import BUILDS_BASE_DIR
+from llama_toolchain.common.exec import run_with_pty
+from llama_toolchain.common.serialize import EnumEncoder
+
+from llama_toolchain.core.datatypes import *  # noqa: F403
+from llama_toolchain.core.distribution import api_providers, SERVER_DEPENDENCIES
+
+
+class BuildType(Enum):
+    container = "container"
+    conda_env = "conda_env"
+
+    def descriptor(self) -> str:
+        return "docker" if self == self.container else "conda"
+
+
+class Dependencies(BaseModel):
+    pip_packages: List[str]
+    docker_image: Optional[str] = None
+
+
+class ApiInput(BaseModel):
+    api: Api
+    provider: str
+
+
+def build_package(
+    api_inputs: List[ApiInput],
+    build_type: BuildType,
+    name: str,
+    distribution_id: Optional[str] = None,
+    docker_image: Optional[str] = None,
+):
+    if not distribution_id:
+        distribution_id = "adhoc"
+
+    build_dir = BUILDS_BASE_DIR / distribution_id / build_type.descriptor()
+    os.makedirs(build_dir, exist_ok=True)
+
+    package_name = name.replace("::", "-")
+    package_file = build_dir / f"{package_name}.yaml"
+
+    all_providers = api_providers()
+
+    package_deps = Dependencies(
+        docker_image=docker_image or "python:3.10-slim",
+        pip_packages=SERVER_DEPENDENCIES,
+    )
+
+    stub_config = {}
+    for api_input in api_inputs:
+        api = api_input.api
+        providers_for_api = all_providers[api]
+        if api_input.provider not in providers_for_api:
+            raise ValueError(
+                f"Provider `{api_input.provider}` is not available for API `{api}`"
+            )
+
+        provider = providers_for_api[api_input.provider]
+        package_deps.pip_packages.extend(provider.pip_packages)
+        if provider.docker_image:
+            raise ValueError("A stack's dependencies cannot have a docker image")
+
+        stub_config[api.value] = {"provider_id": api_input.provider}
+
+    if package_file.exists():
+        cprint(
+            f"Build `{package_name}` exists; will reconfigure",
+            color="yellow",
+        )
+        c = PackageConfig(**yaml.safe_load(package_file.read_text()))
+        for api_str, new_config in stub_config.items():
+            if api_str not in c.providers:
+                c.providers[api_str] = new_config
+            else:
+                existing_config = c.providers[api_str]
+                if existing_config["provider_id"] != new_config["provider_id"]:
+                    cprint(
+                        f"Provider `{api_str}` has changed from `{existing_config}` to `{new_config}`",
+                        color="yellow",
+                    )
+                    c.providers[api_str] = new_config
+    else:
+        c = PackageConfig(
+            built_at=datetime.now(),
+            package_name=package_name,
+            providers=stub_config,
+        )
+
+    c.distribution_id = distribution_id
+    c.docker_image = package_name if build_type == BuildType.container else None
+    c.conda_env = package_name if build_type == BuildType.conda_env else None
+
+    with open(package_file, "w") as f:
+        to_write = json.loads(json.dumps(c.dict(), cls=EnumEncoder))
+        f.write(yaml.dump(to_write, sort_keys=False))
+
+    if build_type == BuildType.container:
+        script = pkg_resources.resource_filename(
+            "llama_toolchain", "core/build_container.sh"
+        )
+        args = [
+            script,
+            distribution_id,
+            package_name,
+            package_deps.docker_image,
+            " ".join(package_deps.pip_packages),
+        ]
+    else:
+        script = pkg_resources.resource_filename(
+            "llama_toolchain", "core/build_conda_env.sh"
+        )
+        args = [
+            script,
+            distribution_id,
+            package_name,
+            " ".join(package_deps.pip_packages),
+        ]
+
+    return_code = run_with_pty(args)
+    if return_code != 0:
+        cprint(
+            f"Failed to build target {package_name} with return code {return_code}",
+            color="red",
+        )
+        return
+
+    cprint(
+        f"Target `{package_name}` built with configuration at {str(package_file)}",
+        color="green",
+    )
--- a/llama_toolchain/distribution/server.py
+++ b/llama_toolchain/distribution/server.py
@ -5,8 +5,10 @@
 # the root directory of this source tree.

 import asyncio
+import inspect
 import json
 import signal
+import traceback
 from collections.abc import (
    AsyncGenerator as AsyncGeneratorABC,
    AsyncIterator as AsyncIteratorABC,
@ -28,18 +30,17 @@ import fire
 import httpx
 import yaml

-from fastapi import FastAPI, HTTPException, Request, Response
+from fastapi import Body, FastAPI, HTTPException, Request, Response
 from fastapi.exceptions import RequestValidationError
 from fastapi.responses import JSONResponse, StreamingResponse
 from fastapi.routing import APIRoute
 from pydantic import BaseModel, ValidationError
 from termcolor import cprint
+from typing_extensions import Annotated

-from .datatypes import Api, DistributionSpec, ProviderSpec, RemoteProviderSpec
-from .distribution import api_endpoints
-from .dynamic import instantiate_client, instantiate_provider
-
-from .registry import resolve_distribution_spec
+from .datatypes import Api, InlineProviderSpec, ProviderSpec, RemoteProviderSpec
+from .distribution import api_endpoints, api_providers
+from .dynamic import instantiate_provider


 def is_async_iterator_type(typ):
@ -66,6 +67,7 @@ def create_sse_event(data: Any) -> str:


 async def global_exception_handler(request: Request, exc: Exception):
+    traceback.print_exception(exc)
    http_exc = translate_exception(exc)

    return JSONResponse(
@ -155,9 +157,8 @@ def create_dynamic_passthrough(
    return endpoint


-def create_dynamic_typed_route(func: Any):
+def create_dynamic_typed_route(func: Any, method: str):
    hints = get_type_hints(func)
-    request_model = next(iter(hints.values()))
    response_model = hints["return"]

    # NOTE: I think it is better to just add a method within each Api
@ -168,7 +169,7 @@ def create_dynamic_typed_route(func: Any):

    if is_streaming:

-        async def endpoint(request: request_model):
+        async def endpoint(**kwargs):
            async def sse_generator(event_gen):
                try:
                    async for item in event_gen:
@ -178,10 +179,7 @@ def create_dynamic_typed_route(func: Any):
                    print("Generator cancelled")
                    await event_gen.aclose()
                except Exception as e:
-                    print(e)
-                    import traceback
-
-                    traceback.print_exc()
+                    traceback.print_exception(e)
                    yield create_sse_event(
                        {
                            "error": {
@ -191,25 +189,38 @@ def create_dynamic_typed_route(func: Any):
                    )

            return StreamingResponse(
-                sse_generator(func(request)), media_type="text/event-stream"
+                sse_generator(func(**kwargs)), media_type="text/event-stream"
            )

    else:

-        async def endpoint(request: request_model):
+        async def endpoint(**kwargs):
            try:
                return (
-                    await func(request)
+                    await func(**kwargs)
                    if asyncio.iscoroutinefunction(func)
-                    else func(request)
+                    else func(**kwargs)
                )
            except Exception as e:
-                print(e)
-                import traceback
-
-                traceback.print_exc()
+                traceback.print_exception(e)
                raise translate_exception(e) from e

+    sig = inspect.signature(func)
+    if method == "post":
+        # make sure every parameter is annotated with Body() so FASTAPI doesn't
+        # do anything too intelligent and ask for some parameters in the query
+        # and some in the body
+        endpoint.__signature__ = sig.replace(
+            parameters=[
+                param.replace(
+                    annotation=Annotated[param.annotation, Body(..., embed=True)]
+                )
+                for param in sig.parameters.values()
+            ]
+        )
+    else:
+        endpoint.__signature__ = sig
+
    return endpoint


@ -219,7 +230,6 @@ def topological_sort(providers: List[ProviderSpec]) -> List[ProviderSpec]:
    def dfs(a: ProviderSpec, visited: Set[Api], stack: List[Api]):
        visited.add(a.api)

-        if not isinstance(a, RemoteProviderSpec):
        for api in a.api_dependencies:
            if api not in visited:
                dfs(by_id[api], visited, stack)
@ -236,9 +246,11 @@ def topological_sort(providers: List[ProviderSpec]) -> List[ProviderSpec]:
    return [by_id[x] for x in stack]


-def resolve_impls(dist: DistributionSpec, config: Dict[str, Any]) -> Dict[Api, Any]:
+def resolve_impls(
+    provider_specs: Dict[str, ProviderSpec], config: Dict[str, Any]
+) -> Dict[Api, Any]:
    provider_configs = config["providers"]
-    provider_specs = topological_sort(dist.provider_specs.values())
+    provider_specs = topological_sort(provider_specs.values())

    impls = {}
    for provider_spec in provider_specs:
@ -248,13 +260,11 @@ def resolve_impls(dist: DistributionSpec, config: Dict[str, Any]) -> Dict[Api, A
                f"Could not find provider_spec config for {api}. Please add it to the config"
            )

-        provider_config = provider_configs[api.value]
-        if isinstance(provider_spec, RemoteProviderSpec):
-            impls[api] = instantiate_client(
-                provider_spec, provider_config["base_url"].rstrip("/")
-            )
-        else:
+        if isinstance(provider_spec, InlineProviderSpec):
            deps = {api: impls[api] for api in provider_spec.api_dependencies}
+        else:
+            deps = {}
+        provider_config = provider_configs[api.value]
        impl = instantiate_provider(provider_spec, provider_config, deps)
        impls[api] = impl

@ -265,24 +275,36 @@ def main(yaml_config: str, port: int = 5000, disable_ipv6: bool = False):
    with open(yaml_config, "r") as fp:
        config = yaml.safe_load(fp)

-    spec = config["spec"]
-    dist = resolve_distribution_spec(spec)
-    if dist is None:
-        raise ValueError(f"Could not find distribution specification `{spec}`")
-
    app = FastAPI()

    all_endpoints = api_endpoints()
-    impls = resolve_impls(dist, config)
+    all_providers = api_providers()

-    for provider_spec in dist.provider_specs.values():
+    provider_specs = {}
+    for api_str, provider_config in config["providers"].items():
+        api = Api(api_str)
+        providers = all_providers[api]
+        provider_id = provider_config["provider_id"]
+        if provider_id not in providers:
+            raise ValueError(
+                f"Unknown provider `{provider_id}` is not available for API `{api}`"
+            )
+
+        provider_specs[api] = providers[provider_id]
+
+    impls = resolve_impls(provider_specs, config)
+
+    for provider_spec in provider_specs.values():
        api = provider_spec.api
        endpoints = all_endpoints[api]
        impl = impls[api]

-        if isinstance(provider_spec, RemoteProviderSpec):
+        if (
+            isinstance(provider_spec, RemoteProviderSpec)
+            and provider_spec.adapter is None
+        ):
            for endpoint in endpoints:
-                url = impl.base_url + endpoint.route
+                url = impl.__provider_config__.url.rstrip("/") + endpoint.route
                getattr(app, endpoint.method)(endpoint.route)(
                    create_dynamic_passthrough(url)
                )
@ -296,7 +318,7 @@ def main(yaml_config: str, port: int = 5000, disable_ipv6: bool = False):

                impl_method = getattr(impl, endpoint.name)
                getattr(app, endpoint.method)(endpoint.route, response_model=None)(
-                    create_dynamic_typed_route(impl_method)
+                    create_dynamic_typed_route(impl_method, endpoint.method)
                )

    for route in app.routes:
@ -307,6 +329,7 @@ def main(yaml_config: str, port: int = 5000, disable_ipv6: bool = False):
                attrs=["bold"],
            )

+    app.exception_handler(RequestValidationError)(global_exception_handler)
    app.exception_handler(Exception)(global_exception_handler)
    signal.signal(signal.SIGINT, handle_sigint)

--- a/llama_toolchain/distribution/start_distribution.sh
+++ b/llama_toolchain/distribution/start_distribution.sh
@ -8,7 +8,6 @@

 set -euo pipefail

-# Define color codes
 RED='\033[0;31m'
 NC='\033[0m' # No Color

@ -17,20 +16,27 @@ error_handler() {
  exit 1
 }

-# Set up the error trap
 trap 'error_handler ${LINENO}' ERR

-if [ $# -lt 2 ]; then
-    echo "Usage: $0 <environment_name> <script_args...>"
+if [ $# -lt 3 ]; then
+  echo "Usage: $0 <build_name> <yaml_config> <port> <script_args...>"
  exit 1
 fi

+build_name="$1"
+env_name="llamastack-$build_name"
+shift

-env_name="$1"
+yaml_config="$1"
+shift
+
+port="$1"
 shift

 eval "$(conda shell.bash hook)"
 conda deactivate && conda activate "$env_name"

-python_interp=$(conda run -n "$env_name" which python)
-$python_interp -m llama_toolchain.distribution.server "$@"
+$CONDA_PREFIX/bin/python \
+  -m llama_toolchain.core.server \
+  --yaml_config "$yaml_config" \
+  --port "$port" "$@"
--- a/llama_toolchain/core/start_container.sh
+++ b/llama_toolchain/core/start_container.sh
@ -0,0 +1,43 @@
+#!/bin/bash
+
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+set -euo pipefail
+
+RED='\033[0;31m'
+NC='\033[0m' # No Color
+
+error_handler() {
+  echo "Error occurred in script at line: ${1}" >&2
+  exit 1
+}
+
+trap 'error_handler ${LINENO}' ERR
+
+if [ $# -lt 3 ]; then
+  echo "Usage: $0 <build_name> <yaml_config> <port> <other_args...>"
+  exit 1
+fi
+
+build_name="$1"
+docker_image="llamastack-$build_name"
+shift
+
+yaml_config="$1"
+shift
+
+port="$1"
+shift
+
+set -x
+podman run -it \
+  -p $port:$port \
+  -v "$yaml_config:/app/config.yaml" \
+  $docker_image \
+  python -m llama_toolchain.core.server \
+  --yaml_config /app/config.yaml \
+  --port $port "$@"
--- a/llama_toolchain/dataset/api/init.py
+++ b/llama_toolchain/dataset/api/init.py
@ -4,5 +4,4 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from .datatypes import *  # noqa: F401 F403
-from .endpoints import *  # noqa: F401 F403
+from .api import *  # noqa: F401 F403
--- a/llama_toolchain/dataset/api/endpoints.py
+++ b/llama_toolchain/dataset/api/endpoints.py
@ -4,13 +4,34 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from typing import Protocol
+from enum import Enum
+from typing import Any, Dict, Optional, Protocol
+
+from llama_models.llama3.api.datatypes import URL

 from llama_models.schema_utils import json_schema_type, webmethod

 from pydantic import BaseModel

-from .datatypes import *  # noqa: F403
+
+@json_schema_type
+class TrainEvalDatasetColumnType(Enum):
+    dialog = "dialog"
+    text = "text"
+    media = "media"
+    number = "number"
+    json = "json"
+
+
+@json_schema_type
+class TrainEvalDataset(BaseModel):
+    """Dataset to be used for training or evaluating language models."""
+
+    # TODO(ashwin): figure out if we need to add an enum for a "dataset type"
+
+    columns: Dict[str, TrainEvalDatasetColumnType]
+    content_url: URL
+    metadata: Optional[Dict[str, Any]] = None


@json_schema_type
--- a/llama_toolchain/dataset/api/datatypes.py
+++ b/llama_toolchain/dataset/api/datatypes.py
@ -1,34 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from enum import Enum
-from typing import Any, Dict, Optional
-
-from llama_models.llama3.api.datatypes import URL
-
-from llama_models.schema_utils import json_schema_type
-
-from pydantic import BaseModel
-
-
-@json_schema_type
-class TrainEvalDatasetColumnType(Enum):
-    dialog = "dialog"
-    text = "text"
-    media = "media"
-    number = "number"
-    json = "json"
-
-
-@json_schema_type
-class TrainEvalDataset(BaseModel):
-    """Dataset to be used for training or evaluating language models."""
-
-    # TODO(ashwin): figure out if we need to add an enum for a "dataset type"
-
-    columns: Dict[str, TrainEvalDatasetColumnType]
-    content_url: URL
-    metadata: Optional[Dict[str, Any]] = None
--- a/llama_toolchain/distribution/datatypes.py
+++ b/llama_toolchain/distribution/datatypes.py
@ -1,106 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from enum import Enum
-from typing import Any, Dict, List, Optional
-
-from llama_models.schema_utils import json_schema_type
-
-from pydantic import BaseModel, Field, validator
-
-
-@json_schema_type
-class Api(Enum):
-    inference = "inference"
-    safety = "safety"
-    agentic_system = "agentic_system"
-
-
-@json_schema_type
-class ApiEndpoint(BaseModel):
-    route: str
-    method: str
-    name: str
-
-
-@json_schema_type
-class ProviderSpec(BaseModel):
-    api: Api
-    provider_id: str
-    config_class: str = Field(
-        ...,
-        description="Fully-qualified classname of the config for this provider",
-    )
-
-
-@json_schema_type
-class InlineProviderSpec(ProviderSpec):
-    pip_packages: List[str] = Field(
-        default_factory=list,
-        description="The pip dependencies needed for this implementation",
-    )
-    module: str = Field(
-        ...,
-        description="""
-Fully-qualified name of the module to import. The module is expected to have:
-
- - `get_provider_impl(config, deps)`: returns the local implementation
-""",
-    )
-    api_dependencies: List[Api] = Field(
-        default_factory=list,
-        description="Higher-level API surfaces may depend on other providers to provide their functionality",
-    )
-
-
-class RemoteProviderConfig(BaseModel):
-    base_url: str = Field(..., description="The base URL for the llama stack provider")
-    api_key: Optional[str] = Field(
-        ..., description="API key, if needed, for the provider"
-    )
-
-    @validator("base_url")
-    @classmethod
-    def validate_base_url(cls, base_url: str) -> str:
-        if not base_url.startswith("http"):
-            raise ValueError(f"URL must start with http: {base_url}")
-        return base_url
-
-
-@json_schema_type
-class RemoteProviderSpec(ProviderSpec):
-    module: str = Field(
-        ...,
-        description="""
-Fully-qualified name of the module to import. The module is expected to have:
- - `get_client_impl(base_url)`: returns a client which can be used to call the remote implementation
-""",
-    )
-    config_class: str = "llama_toolchain.distribution.datatypes.RemoteProviderConfig"
-
-
-@json_schema_type
-class DistributionSpec(BaseModel):
-    spec_id: str
-    description: str
-
-    provider_specs: Dict[Api, ProviderSpec] = Field(
-        default_factory=dict,
-        description="Provider specifications for each of the APIs provided by this distribution",
-    )
-
-
-@json_schema_type
-class DistributionConfig(BaseModel):
-    """References to a installed / configured DistributionSpec"""
-
-    name: str
-    spec: str
-    conda_env: str
-    providers: Dict[str, Any] = Field(
-        default_factory=dict,
-        description="Provider configurations for each of the APIs provided by this distribution",
-    )
--- a/llama_toolchain/distribution/registry.py
+++ b/llama_toolchain/distribution/registry.py
@ -1,79 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from functools import lru_cache
-from typing import List, Optional
-
-from .datatypes import Api, DistributionSpec, RemoteProviderSpec
-from .distribution import api_providers
-
-
-def client_module(api: Api) -> str:
-    return f"llama_toolchain.{api.value}.client"
-
-
-def remote_spec(api: Api) -> RemoteProviderSpec:
-    return RemoteProviderSpec(
-        api=api,
-        provider_id=f"{api.value}-remote",
-        module=client_module(api),
-    )
-
-
-@lru_cache()
-def available_distribution_specs() -> List[DistributionSpec]:
-    providers = api_providers()
-    return [
-        DistributionSpec(
-            spec_id="local",
-            description="Use code from `llama_toolchain` itself to serve all llama stack APIs",
-            provider_specs={
-                Api.inference: providers[Api.inference]["meta-reference"],
-                Api.safety: providers[Api.safety]["meta-reference"],
-                Api.agentic_system: providers[Api.agentic_system]["meta-reference"],
-            },
-        ),
-        DistributionSpec(
-            spec_id="remote",
-            description="Point to remote services for all llama stack APIs",
-            provider_specs={x: remote_spec(x) for x in providers},
-        ),
-        DistributionSpec(
-            spec_id="local-ollama",
-            description="Like local, but use ollama for running LLM inference",
-            provider_specs={
-                Api.inference: providers[Api.inference]["meta-ollama"],
-                Api.safety: providers[Api.safety]["meta-reference"],
-                Api.agentic_system: providers[Api.agentic_system]["meta-reference"],
-            },
-        ),
-        DistributionSpec(
-            spec_id="remote-fireworks",
-            description="Use Fireworks.ai for running LLM inference",
-            provider_specs={
-                Api.inference: providers[Api.inference]["fireworks"],
-                Api.safety: providers[Api.safety]["meta-reference"],
-                Api.agentic_system: providers[Api.agentic_system]["meta-reference"],
-            },
-        ),
-        DistributionSpec(
-            spec_id="remote-together",
-            description="Use Together.ai for running LLM inference",
-            provider_specs={
-                Api.inference: providers[Api.inference]["together"],
-                Api.safety: providers[Api.safety]["meta-reference"],
-                Api.agentic_system: providers[Api.agentic_system]["meta-reference"],
-            },
-        ),
-    ]
-
-
-@lru_cache()
-def resolve_distribution_spec(spec_id: str) -> Optional[DistributionSpec]:
-    for spec in available_distribution_specs():
-        if spec.spec_id == spec_id:
-            return spec
-    return None
--- a/llama_toolchain/evaluations/api/init.py
+++ b/llama_toolchain/evaluations/api/init.py
@ -4,5 +4,4 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from .datatypes import *  # noqa: F401 F403
-from .endpoints import *  # noqa: F401 F403
+from .api import *  # noqa: F401 F403
--- a/llama_toolchain/evaluations/api/endpoints.py
+++ b/llama_toolchain/evaluations/api/endpoints.py
@ -4,6 +4,7 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

+from enum import Enum
 from typing import List, Protocol

 from llama_models.schema_utils import webmethod
@ -11,11 +12,34 @@ from llama_models.schema_utils import webmethod
 from pydantic import BaseModel

 from llama_models.llama3.api.datatypes import *  # noqa: F403
-from .datatypes import *  # noqa: F403
-from llama_toolchain.dataset.api.datatypes import *  # noqa: F403
+from llama_toolchain.dataset.api import *  # noqa: F403
 from llama_toolchain.common.training_types import *  # noqa: F403


+class TextGenerationMetric(Enum):
+    perplexity = "perplexity"
+    rouge = "rouge"
+    bleu = "bleu"
+
+
+class QuestionAnsweringMetric(Enum):
+    em = "em"
+    f1 = "f1"
+
+
+class SummarizationMetric(Enum):
+    rouge = "rouge"
+    bleu = "bleu"
+
+
+class EvaluationJob(BaseModel):
+    job_uuid: str
+
+
+class EvaluationJobLogStream(BaseModel):
+    job_uuid: str
+
+
 class EvaluateTaskRequestCommon(BaseModel):
    job_uuid: str
    dataset: TrainEvalDataset
--- a/llama_toolchain/evaluations/api/datatypes.py
+++ b/llama_toolchain/evaluations/api/datatypes.py
@ -1,33 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from enum import Enum
-
-from pydantic import BaseModel
-
-
-class TextGenerationMetric(Enum):
-    perplexity = "perplexity"
-    rouge = "rouge"
-    bleu = "bleu"
-
-
-class QuestionAnsweringMetric(Enum):
-    em = "em"
-    f1 = "f1"
-
-
-class SummarizationMetric(Enum):
-    rouge = "rouge"
-    bleu = "bleu"
-
-
-class EvaluationJob(BaseModel):
-    job_uuid: str
-
-
-class EvaluationJobLogStream(BaseModel):
-    job_uuid: str
--- a/llama_toolchain/agentic_system/tools/init.py
+++ b/llama_toolchain/agentic_system/tools/init.py
--- a/llama_toolchain/inference/adapters/fireworks/init.py
+++ b/llama_toolchain/inference/adapters/fireworks/init.py
@ -0,0 +1,18 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from .config import FireworksImplConfig
+
+
+async def get_adapter_impl(config: FireworksImplConfig, _deps) -> Inference:
+    from .fireworks import FireworksInferenceAdapter
+
+    assert isinstance(
+        config, FireworksImplConfig
+    ), f"Unexpected config type: {type(config)}"
+    impl = FireworksInferenceAdapter(config)
+    await impl.initialize()
+    return impl
--- a/llama_toolchain/inference/adapters/fireworks/config.py
+++ b/llama_toolchain/inference/adapters/fireworks/config.py
--- a/llama_toolchain/inference/adapters/fireworks/fireworks.py
+++ b/llama_toolchain/inference/adapters/fireworks/fireworks.py
@ -5,9 +5,9 @@
 # the root directory of this source tree.

 import uuid
-from typing import AsyncGenerator, Dict
+from typing import AsyncGenerator

-import httpx
+from fireworks.client import Fireworks

 from llama_models.llama3.api.datatypes import (
    BuiltinTool,
@ -18,20 +18,8 @@ from llama_models.llama3.api.datatypes import (
 )
 from llama_models.llama3.api.tool_utils import ToolUtils
 from llama_models.sku_list import resolve_model
-from fireworks.client import Fireworks

-from llama_toolchain.distribution.datatypes import Api, ProviderSpec
-from llama_toolchain.inference.api import (
-    ChatCompletionRequest,
-    ChatCompletionResponse,
-    ChatCompletionResponseEvent,
-    ChatCompletionResponseEventType,
-    ChatCompletionResponseStreamChunk,
-    CompletionRequest,
-    Inference,
-    ToolCallDelta,
-    ToolCallParseStatus,
-)
+from llama_toolchain.inference.api import *  # noqa: F403

 from .config import FireworksImplConfig

@ -42,18 +30,7 @@ FIREWORKS_SUPPORTED_MODELS = {
 }


-async def get_provider_impl(
-    config: FireworksImplConfig, _deps: Dict[Api, ProviderSpec]
-) -> Inference:
-    assert isinstance(
-        config, FireworksImplConfig
-    ), f"Unexpected config type: {type(config)}"
-    impl = FireworksInference(config)
-    await impl.initialize()
-    return impl
-
-
-class FireworksInference(Inference):
+class FireworksInferenceAdapter(Inference):
    def __init__(self, config: FireworksImplConfig) -> None:
        self.config = config

--- a/llama_toolchain/inference/adapters/ollama/init.py
+++ b/llama_toolchain/inference/adapters/ollama/init.py
@ -0,0 +1,15 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from llama_toolchain.core.datatypes import RemoteProviderConfig
+
+
+async def get_adapter_impl(config: RemoteProviderConfig, _deps):
+    from .ollama import OllamaInferenceAdapter
+
+    impl = OllamaInferenceAdapter(config.url)
+    await impl.initialize()
+    return impl
--- a/llama_toolchain/inference/adapters/ollama/ollama.py
+++ b/llama_toolchain/inference/adapters/ollama/ollama.py
@ -4,63 +4,37 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-import uuid
-from typing import AsyncGenerator, Dict
+from typing import AsyncGenerator

 import httpx

-from llama_models.llama3.api.datatypes import (
-    BuiltinTool,
-    CompletionMessage,
-    Message,
-    StopReason,
-    ToolCall,
-)
-from llama_models.llama3.api.tool_utils import ToolUtils
+from llama_models.llama3.api.chat_format import ChatFormat
+from llama_models.llama3.api.datatypes import Message, StopReason
+from llama_models.llama3.api.tokenizer import Tokenizer
 from llama_models.sku_list import resolve_model
 from ollama import AsyncClient

-from llama_toolchain.distribution.datatypes import Api, ProviderSpec
-from llama_toolchain.inference.api import (
-    ChatCompletionRequest,
-    ChatCompletionResponse,
-    ChatCompletionResponseEvent,
-    ChatCompletionResponseEventType,
-    ChatCompletionResponseStreamChunk,
-    CompletionRequest,
-    Inference,
-    ToolCallDelta,
-    ToolCallParseStatus,
-)
-
-from .config import OllamaImplConfig
+from llama_toolchain.inference.api import *  # noqa: F403
+from llama_toolchain.inference.prepare_messages import prepare_messages

 # TODO: Eventually this will move to the llama cli model list command
 # mapping of Model SKUs to ollama models
 OLLAMA_SUPPORTED_SKUS = {
+    # "Meta-Llama3.1-8B-Instruct": "llama3.1",
    "Meta-Llama3.1-8B-Instruct": "llama3.1:8b-instruct-fp16",
    "Meta-Llama3.1-70B-Instruct": "llama3.1:70b-instruct-fp16",
 }


-async def get_provider_impl(
-    config: OllamaImplConfig, _deps: Dict[Api, ProviderSpec]
-) -> Inference:
-    assert isinstance(
-        config, OllamaImplConfig
-    ), f"Unexpected config type: {type(config)}"
-    impl = OllamaInference(config)
-    await impl.initialize()
-    return impl
-
-
-class OllamaInference(Inference):
-    def __init__(self, config: OllamaImplConfig) -> None:
-        self.config = config
+class OllamaInferenceAdapter(Inference):
+    def __init__(self, url: str) -> None:
+        self.url = url
+        tokenizer = Tokenizer.get_instance()
+        self.formatter = ChatFormat(tokenizer)

    @property
    def client(self) -> AsyncClient:
-        return AsyncClient(host=self.config.url)
+        return AsyncClient(host=self.url)

    async def initialize(self) -> None:
        try:
@ -111,6 +85,7 @@ class OllamaInference(Inference):
        return options

    async def chat_completion(self, request: ChatCompletionRequest) -> AsyncGenerator:
+        messages = prepare_messages(request)
        # accumulate sampling params and other options to pass to ollama
        options = self.get_ollama_chat_options(request)
        ollama_model = self.resolve_ollama_model(request.model)
@ -132,7 +107,7 @@ class OllamaInference(Inference):
        if not request.stream:
            r = await self.client.chat(
                model=ollama_model,
-                messages=self._messages_to_ollama_messages(request.messages),
+                messages=self._messages_to_ollama_messages(messages),
                stream=False,
                options=options,
            )
@ -143,9 +118,8 @@ class OllamaInference(Inference):
                elif r["done_reason"] == "length":
                    stop_reason = StopReason.out_of_tokens

-            completion_message = decode_assistant_message_from_content(
-                r["message"]["content"],
-                stop_reason,
+            completion_message = self.formatter.decode_assistant_message_from_content(
+                r["message"]["content"], stop_reason
            )
            yield ChatCompletionResponse(
                completion_message=completion_message,
@ -160,7 +134,7 @@ class OllamaInference(Inference):
            )
            stream = await self.client.chat(
                model=ollama_model,
-                messages=self._messages_to_ollama_messages(request.messages),
+                messages=self._messages_to_ollama_messages(messages),
                stream=True,
                options=options,
            )
@ -228,7 +202,9 @@ class OllamaInference(Inference):
                    )

            # parse tool calls and report errors
-            message = decode_assistant_message_from_content(buffer, stop_reason)
+            message = self.formatter.decode_assistant_message_from_content(
+                buffer, stop_reason
+            )
            parsed_tool_calls = len(message.tool_calls) > 0
            if ipython and not parsed_tool_calls:
                yield ChatCompletionResponseStreamChunk(
@ -261,70 +237,3 @@ class OllamaInference(Inference):
                    stop_reason=stop_reason,
                )
            )
-
-
-# TODO: Consolidate this with impl in llama-models
-def decode_assistant_message_from_content(
-    content: str,
-    stop_reason: StopReason,
-) -> CompletionMessage:
-    ipython = content.startswith("<|python_tag|>")
-    if ipython:
-        content = content[len("<|python_tag|>") :]
-
-    if content.endswith("<|eot_id|>"):
-        content = content[: -len("<|eot_id|>")]
-        stop_reason = StopReason.end_of_turn
-    elif content.endswith("<|eom_id|>"):
-        content = content[: -len("<|eom_id|>")]
-        stop_reason = StopReason.end_of_message
-
-    tool_name = None
-    tool_arguments = {}
-
-    custom_tool_info = ToolUtils.maybe_extract_custom_tool_call(content)
-    if custom_tool_info is not None:
-        tool_name, tool_arguments = custom_tool_info
-        # Sometimes when agent has custom tools alongside builin tools
-        # Agent responds for builtin tool calls in the format of the custom tools
-        # This code tries to handle that case
-        if tool_name in BuiltinTool.__members__:
-            tool_name = BuiltinTool[tool_name]
-            tool_arguments = {
-                "query": list(tool_arguments.values())[0],
-            }
-    else:
-        builtin_tool_info = ToolUtils.maybe_extract_builtin_tool_call(content)
-        if builtin_tool_info is not None:
-            tool_name, query = builtin_tool_info
-            tool_arguments = {
-                "query": query,
-            }
-            if tool_name in BuiltinTool.__members__:
-                tool_name = BuiltinTool[tool_name]
-        elif ipython:
-            tool_name = BuiltinTool.code_interpreter
-            tool_arguments = {
-                "code": content,
-            }
-
-    tool_calls = []
-    if tool_name is not None and tool_arguments is not None:
-        call_id = str(uuid.uuid4())
-        tool_calls.append(
-            ToolCall(
-                call_id=call_id,
-                tool_name=tool_name,
-                arguments=tool_arguments,
-            )
-        )
-        content = ""
-
-    if stop_reason is None:
-        stop_reason = StopReason.out_of_tokens
-
-    return CompletionMessage(
-        content=content,
-        stop_reason=stop_reason,
-        tool_calls=tool_calls,
-    )
--- a/llama_toolchain/inference/adapters/together/init.py
+++ b/llama_toolchain/inference/adapters/together/init.py
@ -0,0 +1,18 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from .config import TogetherImplConfig
+
+
+async def get_adapter_impl(config: TogetherImplConfig, _deps) -> Inference:
+    from .together import TogetherInferenceAdapter
+
+    assert isinstance(
+        config, TogetherImplConfig
+    ), f"Unexpected config type: {type(config)}"
+    impl = TogetherInferenceAdapter(config)
+    await impl.initialize()
+    return impl
--- a/llama_toolchain/inference/adapters/together/config.py
+++ b/llama_toolchain/inference/adapters/together/config.py
--- a/llama_toolchain/inference/adapters/together/together.py
+++ b/llama_toolchain/inference/adapters/together/together.py
@ -5,7 +5,7 @@
 # the root directory of this source tree.

 import uuid
-from typing import AsyncGenerator, Dict
+from typing import AsyncGenerator

 from llama_models.llama3.api.datatypes import (
    BuiltinTool,
@ -18,18 +18,7 @@ from llama_models.llama3.api.tool_utils import ToolUtils
 from llama_models.sku_list import resolve_model
 from together import Together

-from llama_toolchain.distribution.datatypes import Api, ProviderSpec
-from llama_toolchain.inference.api import (
-    ChatCompletionRequest,
-    ChatCompletionResponse,
-    ChatCompletionResponseEvent,
-    ChatCompletionResponseEventType,
-    ChatCompletionResponseStreamChunk,
-    CompletionRequest,
-    Inference,
-    ToolCallDelta,
-    ToolCallParseStatus,
-)
+from llama_toolchain.inference.api import *  # noqa: F403

 from .config import TogetherImplConfig

@ -40,18 +29,7 @@ TOGETHER_SUPPORTED_MODELS = {
 }


-async def get_provider_impl(
-    config: TogetherImplConfig, _deps: Dict[Api, ProviderSpec]
-) -> Inference:
-    assert isinstance(
-        config, TogetherImplConfig
-    ), f"Unexpected config type: {type(config)}"
-    impl = TogetherInference(config)
-    await impl.initialize()
-    return impl
-
-
-class TogetherInference(Inference):
+class TogetherInferenceAdapter(Inference):
    def __init__(self, config: TogetherImplConfig) -> None:
        self.config = config

--- a/llama_toolchain/inference/api/init.py
+++ b/llama_toolchain/inference/api/init.py
@ -4,5 +4,4 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from .datatypes import *  # noqa: F401 F403
-from .endpoints import *  # noqa: F401 F403
+from .api import *  # noqa: F401 F403
--- a/llama_toolchain/inference/api/endpoints.py
+++ b/llama_toolchain/inference/api/endpoints.py
@ -4,17 +4,79 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from .datatypes import *  # noqa: F403
-from typing import Optional, Protocol
+from enum import Enum

-# this dependency is annoying and we need a forked up version anyway
-from llama_models.schema_utils import webmethod
+from typing import List, Literal, Optional, Protocol, Union
+
+from llama_models.schema_utils import json_schema_type, webmethod
+
+from pydantic import BaseModel, Field
+from typing_extensions import Annotated
+
+from llama_models.llama3.api.datatypes import *  # noqa: F403
+
+
+class LogProbConfig(BaseModel):
+    top_k: Optional[int] = 0
+
+
+@json_schema_type
+class QuantizationType(Enum):
+    bf16 = "bf16"
+    fp8 = "fp8"
+
+
+@json_schema_type
+class Fp8QuantizationConfig(BaseModel):
+    type: Literal[QuantizationType.fp8.value] = QuantizationType.fp8.value
+
+
+@json_schema_type
+class Bf16QuantizationConfig(BaseModel):
+    type: Literal[QuantizationType.bf16.value] = QuantizationType.bf16.value
+
+
+QuantizationConfig = Annotated[
+    Union[Bf16QuantizationConfig, Fp8QuantizationConfig],
+    Field(discriminator="type"),
+]
+
+
+@json_schema_type
+class ChatCompletionResponseEventType(Enum):
+    start = "start"
+    complete = "complete"
+    progress = "progress"
+
+
+@json_schema_type
+class ToolCallParseStatus(Enum):
+    started = "started"
+    in_progress = "in_progress"
+    failure = "failure"
+    success = "success"
+
+
+@json_schema_type
+class ToolCallDelta(BaseModel):
+    content: Union[str, ToolCall]
+    parse_status: ToolCallParseStatus
+
+
+@json_schema_type
+class ChatCompletionResponseEvent(BaseModel):
+    """Chat completion response event."""
+
+    event_type: ChatCompletionResponseEventType
+    delta: Union[str, ToolCallDelta]
+    logprobs: Optional[List[TokenLogProbs]] = None
+    stop_reason: Optional[StopReason] = None


@json_schema_type
 class CompletionRequest(BaseModel):
    model: str
-    content: InterleavedTextAttachment
+    content: InterleavedTextMedia
    sampling_params: Optional[SamplingParams] = SamplingParams()

    stream: Optional[bool] = False
@ -39,7 +101,7 @@ class CompletionResponseStreamChunk(BaseModel):
@json_schema_type
 class BatchCompletionRequest(BaseModel):
    model: str
-    content_batch: List[InterleavedTextAttachment]
+    content_batch: List[InterleavedTextMedia]
    sampling_params: Optional[SamplingParams] = SamplingParams()
    logprobs: Optional[LogProbConfig] = None

@ -56,7 +118,11 @@ class ChatCompletionRequest(BaseModel):
    sampling_params: Optional[SamplingParams] = SamplingParams()

    # zero-shot tool definitions as input to the model
-    available_tools: Optional[List[ToolDefinition]] = Field(default_factory=list)
+    tools: Optional[List[ToolDefinition]] = Field(default_factory=list)
+    tool_choice: Optional[ToolChoice] = Field(default=ToolChoice.auto)
+    tool_prompt_format: Optional[ToolPromptFormat] = Field(
+        default=ToolPromptFormat.json
+    )

    stream: Optional[bool] = False
    logprobs: Optional[LogProbConfig] = None
@ -82,8 +148,11 @@ class BatchChatCompletionRequest(BaseModel):
    sampling_params: Optional[SamplingParams] = SamplingParams()

    # zero-shot tool definitions as input to the model
-    available_tools: Optional[List[ToolDefinition]] = Field(default_factory=list)
-
+    tools: Optional[List[ToolDefinition]] = Field(default_factory=list)
+    tool_choice: Optional[ToolChoice] = Field(default=ToolChoice.auto)
+    tool_prompt_format: Optional[ToolPromptFormat] = Field(
+        default=ToolPromptFormat.json
+    )
    logprobs: Optional[LogProbConfig] = None


@ -92,6 +161,11 @@ class BatchChatCompletionResponse(BaseModel):
    completion_message_batch: List[CompletionMessage]


+@json_schema_type
+class EmbeddingsResponse(BaseModel):
+    embeddings: List[List[float]]
+
+
 class Inference(Protocol):
    @webmethod(route="/inference/completion")
    async def completion(
@ -105,14 +179,9 @@ class Inference(Protocol):
        request: ChatCompletionRequest,
    ) -> Union[ChatCompletionResponse, ChatCompletionResponseStreamChunk]: ...

-    @webmethod(route="/inference/batch_completion")
-    async def batch_completion(
+    @webmethod(route="/inference/embeddings")
+    async def embeddings(
        self,
-        request: BatchCompletionRequest,
-    ) -> BatchCompletionResponse: ...
-
-    @webmethod(route="/inference/batch_chat_completion")
-    async def batch_chat_completion(
-        self,
-        request: BatchChatCompletionRequest,
-    ) -> BatchChatCompletionResponse: ...
+        model: str,
+        contents: List[InterleavedTextMedia],
+    ) -> EmbeddingsResponse: ...
--- a/llama_toolchain/inference/api/datatypes.py
+++ b/llama_toolchain/inference/api/datatypes.py
@ -1,72 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from enum import Enum
-from typing import List, Literal, Optional, Union
-
-from llama_models.schema_utils import json_schema_type
-
-from pydantic import BaseModel, Field
-from typing_extensions import Annotated
-
-from llama_models.llama3.api.datatypes import *  # noqa: F403
-
-
-class LogProbConfig(BaseModel):
-    top_k: Optional[int] = 0
-
-
-@json_schema_type
-class QuantizationType(Enum):
-    bf16 = "bf16"
-    fp8 = "fp8"
-
-
-@json_schema_type
-class Fp8QuantizationConfig(BaseModel):
-    type: Literal[QuantizationType.fp8.value] = QuantizationType.fp8.value
-
-
-@json_schema_type
-class Bf16QuantizationConfig(BaseModel):
-    type: Literal[QuantizationType.bf16.value] = QuantizationType.bf16.value
-
-
-QuantizationConfig = Annotated[
-    Union[Bf16QuantizationConfig, Fp8QuantizationConfig],
-    Field(discriminator="type"),
-]
-
-
-@json_schema_type
-class ChatCompletionResponseEventType(Enum):
-    start = "start"
-    complete = "complete"
-    progress = "progress"
-
-
-@json_schema_type
-class ToolCallParseStatus(Enum):
-    started = "started"
-    in_progress = "in_progress"
-    failure = "failure"
-    success = "success"
-
-
-@json_schema_type
-class ToolCallDelta(BaseModel):
-    content: Union[str, ToolCall]
-    parse_status: ToolCallParseStatus
-
-
-@json_schema_type
-class ChatCompletionResponseEvent(BaseModel):
-    """Chat completion response event."""
-
-    event_type: ChatCompletionResponseEventType
-    delta: Union[str, ToolCallDelta]
-    logprobs: Optional[List[TokenLogProbs]] = None
-    stop_reason: Optional[StopReason] = None
--- a/llama_toolchain/inference/client.py
+++ b/llama_toolchain/inference/client.py
@ -6,12 +6,15 @@

 import asyncio
 import json
-from typing import AsyncGenerator
+from typing import Any, AsyncGenerator

 import fire
 import httpx
+from pydantic import BaseModel
 from termcolor import cprint

+from llama_toolchain.core.datatypes import RemoteProviderConfig
+
 from .api import (
    ChatCompletionRequest,
    ChatCompletionResponse,
@ -23,13 +26,16 @@ from .api import (
 from .event_logger import EventLogger


-async def get_client_impl(base_url: str):
-    return InferenceClient(base_url)
+async def get_client_impl(config: RemoteProviderConfig, _deps: Any) -> Inference:
+    return InferenceClient(config.url)
+
+
+def encodable_dict(d: BaseModel):
+    return json.loads(d.json())


 class InferenceClient(Inference):
    def __init__(self, base_url: str):
-        print(f"Initializing client for {base_url}")
        self.base_url = base_url

    async def initialize(self) -> None:
@ -46,7 +52,9 @@ class InferenceClient(Inference):
            async with client.stream(
                "POST",
                f"{self.base_url}/inference/chat_completion",
-                data=request.json(),
+                json={
+                    "request": encodable_dict(request),
+                },
                headers={"Content-Type": "application/json"},
                timeout=20,
            ) as response:
--- a/llama_toolchain/inference/meta_reference/init.py
+++ b/llama_toolchain/inference/meta_reference/init.py
@ -5,4 +5,15 @@
 # the root directory of this source tree.

 from .config import MetaReferenceImplConfig  # noqa
-from .inference import get_provider_impl  # noqa
+
+
+async def get_provider_impl(config: MetaReferenceImplConfig, _deps):
+    from .inference import MetaReferenceInferenceImpl
+
+    assert isinstance(
+        config, MetaReferenceImplConfig
+    ), f"Unexpected config type: {type(config)}"
+
+    impl = MetaReferenceInferenceImpl(config)
+    await impl.initialize()
+    return impl
--- a/llama_toolchain/inference/meta_reference/config.py
+++ b/llama_toolchain/inference/meta_reference/config.py
@ -11,10 +11,10 @@ from llama_models.datatypes import ModelFamily
 from llama_models.schema_utils import json_schema_type
 from llama_models.sku_list import all_registered_models

-from llama_toolchain.inference.api import QuantizationConfig
-
 from pydantic import BaseModel, Field, field_validator

+from llama_toolchain.inference.api import QuantizationConfig
+

@json_schema_type
 class MetaReferenceImplConfig(BaseModel):
--- a/llama_toolchain/inference/meta_reference/generation.py
+++ b/llama_toolchain/inference/meta_reference/generation.py
@ -24,7 +24,7 @@ from fairscale.nn.model_parallel.initialize import (
 )
 from llama_models.llama3.api.args import ModelArgs
 from llama_models.llama3.api.chat_format import ChatFormat, ModelInput
-from llama_models.llama3.api.datatypes import Message
+from llama_models.llama3.api.datatypes import Message, ToolPromptFormat
 from llama_models.llama3.api.tokenizer import Tokenizer
 from llama_models.llama3.reference_impl.model import Transformer
 from llama_models.sku_list import resolve_model
@ -279,6 +279,7 @@ class Llama:
        top_p: float = 0.9,
        max_gen_len: Optional[int] = None,
        logprobs: bool = False,
+        tool_prompt_format: ToolPromptFormat = ToolPromptFormat.json,
    ) -> Generator:
        if (
            max_gen_len is None
@ -288,7 +289,10 @@ class Llama:
            max_gen_len = self.model.params.max_seq_len - 1

        yield from self.generate(
-            model_input=self.formatter.encode_dialog_prompt(messages),
+            model_input=self.formatter.encode_dialog_prompt(
+                messages,
+                tool_prompt_format,
+            ),
            max_gen_len=max_gen_len,
            temperature=temperature,
            top_p=top_p,
--- a/llama_toolchain/inference/meta_reference/inference.py
+++ b/llama_toolchain/inference/meta_reference/inference.py
@ -6,12 +6,11 @@

 import asyncio

-from typing import AsyncIterator, Dict, Union
+from typing import AsyncIterator, Union

 from llama_models.llama3.api.datatypes import StopReason
 from llama_models.sku_list import resolve_model

-from llama_toolchain.distribution.datatypes import Api, ProviderSpec
 from llama_toolchain.inference.api import (
    ChatCompletionRequest,
    ChatCompletionResponse,
@ -22,23 +21,11 @@ from llama_toolchain.inference.api import (
    ToolCallDelta,
    ToolCallParseStatus,
 )
-
+from llama_toolchain.inference.prepare_messages import prepare_messages
 from .config import MetaReferenceImplConfig
 from .model_parallel import LlamaModelParallelGenerator


-async def get_provider_impl(
-    config: MetaReferenceImplConfig, _deps: Dict[Api, ProviderSpec]
-):
-    assert isinstance(
-        config, MetaReferenceImplConfig
-    ), f"Unexpected config type: {type(config)}"
-
-    impl = MetaReferenceInferenceImpl(config)
-    await impl.initialize()
-    return impl
-
-
 # there's a single model parallel process running serving the model. for now,
 # we don't support multiple concurrent requests to this process.
 SEMAPHORE = asyncio.Semaphore(1)
@ -67,6 +54,7 @@ class MetaReferenceInferenceImpl(Inference):
    ) -> AsyncIterator[
        Union[ChatCompletionResponseStreamChunk, ChatCompletionResponse]
    ]:
+        messages = prepare_messages(request)
        model = resolve_model(request.model)
        if model is None:
            raise RuntimeError(
@ -98,11 +86,12 @@ class MetaReferenceInferenceImpl(Inference):
            ipython = False

            for token_result in self.generator.chat_completion(
-                messages=request.messages,
+                messages=messages,
                temperature=request.sampling_params.temperature,
                top_p=request.sampling_params.top_p,
                max_gen_len=request.sampling_params.max_tokens,
                logprobs=request.logprobs,
+                tool_prompt_format=request.tool_prompt_format,
            ):
                buffer += token_result.text
                tokens.append(token_result.token)
--- a/llama_toolchain/inference/meta_reference/model_parallel.py
+++ b/llama_toolchain/inference/meta_reference/model_parallel.py
@ -11,7 +11,7 @@ from functools import partial
 from typing import Generator, List, Optional

 from llama_models.llama3.api.chat_format import ChatFormat
-from llama_models.llama3.api.datatypes import Message
+from llama_models.llama3.api.datatypes import Message, ToolPromptFormat
 from llama_models.llama3.api.tokenizer import Tokenizer
 from llama_models.sku_list import resolve_model

@ -27,6 +27,7 @@ class InferenceArgs:
    top_p: float
    max_gen_len: int
    logprobs: bool
+    tool_prompt_format: ToolPromptFormat


 class ModelRunner:
@ -41,6 +42,7 @@ class ModelRunner:
            task.top_p,
            task.max_gen_len,
            task.logprobs,
+            task.tool_prompt_format,
        )


@ -93,6 +95,7 @@ class LlamaModelParallelGenerator:
        top_p: float = 0.9,
        max_gen_len: Optional[int] = None,
        logprobs: bool = False,
+        tool_prompt_format: ToolPromptFormat = ToolPromptFormat.json,
    ) -> Generator:
        req_obj = InferenceArgs(
            messages=deepcopy(messages),
@ -100,6 +103,7 @@ class LlamaModelParallelGenerator:
            top_p=top_p,
            max_gen_len=max_gen_len,
            logprobs=logprobs,
+            tool_prompt_format=tool_prompt_format,
        )

        gen = self.group.run_inference(req_obj)
--- a/llama_toolchain/inference/prepare_messages.py
+++ b/llama_toolchain/inference/prepare_messages.py
@ -0,0 +1,84 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from llama_models.llama3.api.datatypes import *  # noqa: F403
+from llama_toolchain.inference.api import *  # noqa: F403
+from llama_models.llama3.prompt_templates import (
+    BuiltinToolGenerator,
+    FunctionTagCustomToolGenerator,
+    JsonCustomToolGenerator,
+    SystemDefaultGenerator,
+)
+
+
+def prepare_messages(request: ChatCompletionRequest) -> List[Message]:
+
+    assert request.tool_choice == ToolChoice.auto, "Only `ToolChoice.auto` supported"
+
+    existing_messages = request.messages
+    existing_system_message = None
+    if existing_messages[0].role == Role.system.value:
+        existing_system_message = existing_messages.pop(0)
+
+    assert (
+        existing_messages[0].role != Role.system.value
+    ), "Should only have 1 system message"
+
+    messages = []
+
+    default_gen = SystemDefaultGenerator()
+    default_template = default_gen.gen()
+
+    sys_content = ""
+
+    tool_template = None
+    if request.tools:
+        tool_gen = BuiltinToolGenerator()
+        tool_template = tool_gen.gen(request.tools)
+
+        sys_content += tool_template.render()
+        sys_content += "\n"
+
+    sys_content += default_template.render()
+
+    if existing_system_message:
+        # TODO: this fn is needed in many places
+        def _process(c):
+            if isinstance(c, str):
+                return c
+            else:
+                return "<media>"
+
+        sys_content += "\n"
+
+        if isinstance(existing_system_message.content, str):
+            sys_content += _process(existing_system_message.content)
+        elif isinstance(existing_system_message.content, list):
+            sys_content += "\n".join(
+                [_process(c) for c in existing_system_message.content]
+            )
+
+    messages.append(SystemMessage(content=sys_content))
+
+    has_custom_tools = any(isinstance(dfn.tool_name, str) for dfn in request.tools)
+    if has_custom_tools:
+        if request.tool_prompt_format == ToolPromptFormat.json:
+            tool_gen = JsonCustomToolGenerator()
+        elif request.tool_prompt_format == ToolPromptFormat.function_tag:
+            tool_gen = FunctionTagCustomToolGenerator()
+        else:
+            raise ValueError(
+                f"Non supported ToolPromptFormat {request.tool_prompt_format}"
+            )
+
+        custom_tools = [t for t in request.tools if isinstance(t.tool_name, str)]
+        custom_template = tool_gen.gen(custom_tools)
+        messages.append(UserMessage(content=custom_template.render()))
+
+    # Add back existing messages from the request
+    messages += existing_messages
+
+    return messages
--- a/llama_toolchain/inference/providers.py
+++ b/llama_toolchain/inference/providers.py
@ -6,7 +6,7 @@

 from typing import List

-from llama_toolchain.distribution.datatypes import Api, InlineProviderSpec, ProviderSpec
+from llama_toolchain.core.datatypes import *  # noqa: F403


 def available_inference_providers() -> List[ProviderSpec]:
@ -27,14 +27,13 @@ def available_inference_providers() -> List[ProviderSpec]:
            module="llama_toolchain.inference.meta_reference",
            config_class="llama_toolchain.inference.meta_reference.MetaReferenceImplConfig",
        ),
-        InlineProviderSpec(
+        remote_provider_spec(
            api=Api.inference,
-            provider_id="meta-ollama",
-            pip_packages=[
-                "ollama",
-            ],
-            module="llama_toolchain.inference.ollama",
-            config_class="llama_toolchain.inference.ollama.OllamaImplConfig",
+            adapter=AdapterSpec(
+                adapter_id="ollama",
+                pip_packages=["ollama"],
+                module="llama_toolchain.inference.adapters.ollama",
+            ),
        ),
        InlineProviderSpec(
            api=Api.inference,
--- a/llama_toolchain/inference/quantization/loader.py
+++ b/llama_toolchain/inference/quantization/loader.py
@ -14,12 +14,12 @@ import torch

 from fairscale.nn.model_parallel.mappings import reduce_from_model_parallel_region
 from llama_models.llama3.api.model import Transformer, TransformerBlock
+from llama_toolchain.inference.api import QuantizationType

 from llama_toolchain.inference.api.config import (
    CheckpointQuantizationFormat,
    MetaReferenceImplConfig,
 )
-from llama_toolchain.inference.api.datatypes import QuantizationType

 from termcolor import cprint
 from torch import Tensor
--- a/llama_toolchain/memory/api/init.py
+++ b/llama_toolchain/memory/api/init.py
@ -4,5 +4,4 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from .datatypes import *  # noqa: F401 F403
-from .endpoints import *  # noqa: F401 F403
+from .api import *  # noqa: F401 F403
--- a/llama_toolchain/memory/api/api.py
+++ b/llama_toolchain/memory/api/api.py
@ -0,0 +1,157 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from typing import List, Optional, Protocol
+
+from llama_models.schema_utils import json_schema_type, webmethod
+
+from pydantic import BaseModel, Field
+from typing_extensions import Annotated
+
+from llama_models.llama3.api.datatypes import *  # noqa: F403
+
+
+@json_schema_type
+class MemoryBankDocument(BaseModel):
+    document_id: str
+    content: InterleavedTextMedia | URL
+    mime_type: str
+    metadata: Dict[str, Any] = Field(default_factory=dict)
+
+
+@json_schema_type
+class MemoryBankType(Enum):
+    vector = "vector"
+    keyvalue = "keyvalue"
+    keyword = "keyword"
+    graph = "graph"
+
+
+class VectorMemoryBankConfig(BaseModel):
+    type: Literal[MemoryBankType.vector.value] = MemoryBankType.vector.value
+    embedding_model: str
+    chunk_size_in_tokens: int
+    overlap_size_in_tokens: Optional[int] = None
+
+
+class KeyValueMemoryBankConfig(BaseModel):
+    type: Literal[MemoryBankType.keyvalue.value] = MemoryBankType.keyvalue.value
+
+
+class KeywordMemoryBankConfig(BaseModel):
+    type: Literal[MemoryBankType.keyword.value] = MemoryBankType.keyword.value
+
+
+class GraphMemoryBankConfig(BaseModel):
+    type: Literal[MemoryBankType.graph.value] = MemoryBankType.graph.value
+
+
+MemoryBankConfig = Annotated[
+    Union[
+        VectorMemoryBankConfig,
+        KeyValueMemoryBankConfig,
+        KeywordMemoryBankConfig,
+        GraphMemoryBankConfig,
+    ],
+    Field(discriminator="type"),
+]
+
+
+class Chunk(BaseModel):
+    content: InterleavedTextMedia
+    token_count: int
+    document_id: str
+
+
+@json_schema_type
+class QueryDocumentsResponse(BaseModel):
+    chunks: List[Chunk]
+    scores: List[float]
+
+
+@json_schema_type
+class QueryAPI(Protocol):
+    @webmethod(route="/query_documents")
+    def query_documents(
+        self,
+        query: InterleavedTextMedia,
+        params: Optional[Dict[str, Any]] = None,
+    ) -> QueryDocumentsResponse: ...
+
+
+@json_schema_type
+class MemoryBank(BaseModel):
+    bank_id: str
+    name: str
+    config: MemoryBankConfig
+    # if there's a pre-existing (reachable-from-distribution) store which supports QueryAPI
+    url: Optional[URL] = None
+
+
+class Memory(Protocol):
+    @webmethod(route="/memory_banks/create")
+    async def create_memory_bank(
+        self,
+        name: str,
+        config: MemoryBankConfig,
+        url: Optional[URL] = None,
+    ) -> MemoryBank: ...
+
+    @webmethod(route="/memory_banks/list", method="GET")
+    async def list_memory_banks(self) -> List[MemoryBank]: ...
+
+    @webmethod(route="/memory_banks/get", method="GET")
+    async def get_memory_bank(self, bank_id: str) -> Optional[MemoryBank]: ...
+
+    @webmethod(route="/memory_banks/drop", method="DELETE")
+    async def drop_memory_bank(
+        self,
+        bank_id: str,
+    ) -> str: ...
+
+    # this will just block now until documents are inserted, but it should
+    # probably return a Job instance which can be polled for completion
+    @webmethod(route="/memory_bank/insert")
+    async def insert_documents(
+        self,
+        bank_id: str,
+        documents: List[MemoryBankDocument],
+        ttl_seconds: Optional[int] = None,
+    ) -> None: ...
+
+    @webmethod(route="/memory_bank/update")
+    async def update_documents(
+        self,
+        bank_id: str,
+        documents: List[MemoryBankDocument],
+    ) -> None: ...
+
+    @webmethod(route="/memory_bank/query")
+    async def query_documents(
+        self,
+        bank_id: str,
+        query: InterleavedTextMedia,
+        params: Optional[Dict[str, Any]] = None,
+    ) -> QueryDocumentsResponse: ...
+
+    @webmethod(route="/memory_bank/documents/get", method="GET")
+    async def get_documents(
+        self,
+        bank_id: str,
+        document_ids: List[str],
+    ) -> List[MemoryBankDocument]: ...
+
+    @webmethod(route="/memory_bank/documents/delete", method="DELETE")
+    async def delete_documents(
+        self,
+        bank_id: str,
+        document_ids: List[str],
+    ) -> None: ...
--- a/llama_toolchain/memory/api/datatypes.py
+++ b/llama_toolchain/memory/api/datatypes.py
@ -1,25 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from typing import Any, Dict
-
-from llama_models.schema_utils import json_schema_type
-
-from pydantic import BaseModel
-
-
-@json_schema_type
-class MemoryBank(BaseModel):
-    memory_bank_id: str
-    memory_bank_name: str
-
-
-@json_schema_type
-class MemoryBankDocument(BaseModel):
-    document_id: str
-    content: bytes
-    metadata: Dict[str, Any]
-    mime_type: str
--- a/llama_toolchain/memory/api/endpoints.py
+++ b/llama_toolchain/memory/api/endpoints.py
@ -1,61 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from typing import List, Protocol
-
-from llama_models.schema_utils import webmethod
-
-from .datatypes import *  # noqa: F403
-
-
-class MemoryBanks(Protocol):
-    @webmethod(route="/memory_banks/create")
-    def create_memory_bank(
-        self,
-        bank_id: str,
-        bank_name: str,
-        documents: List[MemoryBankDocument],
-    ) -> None: ...
-
-    @webmethod(route="/memory_banks/list")
-    def get_memory_banks(self) -> List[MemoryBank]: ...
-
-    @webmethod(route="/memory_banks/get")
-    def get_memory_bank(self, bank_id: str) -> List[MemoryBank]: ...
-
-    @webmethod(route="/memory_banks/drop")
-    def delete_memory_bank(
-        self,
-        bank_id: str,
-    ) -> str: ...
-
-    @webmethod(route="/memory_bank/insert")
-    def insert_memory_documents(
-        self,
-        bank_id: str,
-        documents: List[MemoryBankDocument],
-    ) -> None: ...
-
-    @webmethod(route="/memory_bank/update")
-    def update_memory_documents(
-        self,
-        bank_id: str,
-        documents: List[MemoryBankDocument],
-    ) -> None: ...
-
-    @webmethod(route="/memory_bank/get")
-    def get_memory_documents(
-        self,
-        bank_id: str,
-        document_uuids: List[str],
-    ) -> List[MemoryBankDocument]: ...
-
-    @webmethod(route="/memory_bank/delete")
-    def delete_memory_documents(
-        self,
-        bank_id: str,
-        document_uuids: List[str],
-    ) -> List[str]: ...
--- a/llama_toolchain/memory/client.py
+++ b/llama_toolchain/memory/client.py
@ -0,0 +1,181 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+import asyncio
+
+from typing import Any, Dict, List, Optional
+
+import fire
+import httpx
+
+from llama_toolchain.core.datatypes import RemoteProviderConfig
+
+from .api import *  # noqa: F403
+
+
+async def get_client_impl(config: RemoteProviderConfig, _deps: Any) -> Memory:
+    return MemoryClient(config.url)
+
+
+class MemoryClient(Memory):
+    def __init__(self, base_url: str):
+        self.base_url = base_url
+
+    async def initialize(self) -> None:
+        pass
+
+    async def shutdown(self) -> None:
+        pass
+
+    async def get_memory_bank(self, bank_id: str) -> Optional[MemoryBank]:
+        async with httpx.AsyncClient() as client:
+            r = await client.get(
+                f"{self.base_url}/memory_banks/get",
+                params={
+                    "bank_id": bank_id,
+                },
+                headers={"Content-Type": "application/json"},
+                timeout=20,
+            )
+            r.raise_for_status()
+            d = r.json()
+            if not d:
+                return None
+            return MemoryBank(**d)
+
+    async def create_memory_bank(
+        self,
+        name: str,
+        config: MemoryBankConfig,
+        url: Optional[URL] = None,
+    ) -> MemoryBank:
+        async with httpx.AsyncClient() as client:
+            r = await client.post(
+                f"{self.base_url}/memory_banks/create",
+                json={
+                    "name": name,
+                    "config": config.dict(),
+                    "url": url,
+                },
+                headers={"Content-Type": "application/json"},
+                timeout=20,
+            )
+            r.raise_for_status()
+            d = r.json()
+            if not d:
+                return None
+            return MemoryBank(**d)
+
+    async def insert_documents(
+        self,
+        bank_id: str,
+        documents: List[MemoryBankDocument],
+    ) -> None:
+        async with httpx.AsyncClient() as client:
+            r = await client.post(
+                f"{self.base_url}/memory_bank/insert",
+                json={
+                    "bank_id": bank_id,
+                    "documents": [d.dict() for d in documents],
+                },
+                headers={"Content-Type": "application/json"},
+                timeout=20,
+            )
+            r.raise_for_status()
+
+    async def query_documents(
+        self,
+        bank_id: str,
+        query: InterleavedTextMedia,
+        params: Optional[Dict[str, Any]] = None,
+    ) -> QueryDocumentsResponse:
+        async with httpx.AsyncClient() as client:
+            r = await client.post(
+                f"{self.base_url}/memory_bank/query",
+                json={
+                    "bank_id": bank_id,
+                    "query": query,
+                    "params": params,
+                },
+                headers={"Content-Type": "application/json"},
+                timeout=20,
+            )
+            r.raise_for_status()
+            return QueryDocumentsResponse(**r.json())
+
+
+async def run_main(host: str, port: int, stream: bool):
+    client = MemoryClient(f"http://{host}:{port}")
+
+    # create a memory bank
+    bank = await client.create_memory_bank(
+        name="test_bank",
+        config=VectorMemoryBankConfig(
+            bank_id="test_bank",
+            embedding_model="dragon-roberta-query-2",
+            chunk_size_in_tokens=512,
+            overlap_size_in_tokens=64,
+        ),
+    )
+    print(bank)
+
+    retrieved_bank = await client.get_memory_bank(bank.bank_id)
+    assert retrieved_bank is not None
+    assert retrieved_bank.config.embedding_model == "dragon-roberta-query-2"
+
+    urls = [
+        "memory_optimizations.rst",
+        "chat.rst",
+        "llama3.rst",
+        "datasets.rst",
+        "qat_finetune.rst",
+        "lora_finetune.rst",
+    ]
+    documents = [
+        MemoryBankDocument(
+            document_id=f"num-{i}",
+            content=URL(
+                uri=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}"
+            ),
+            mime_type="text/plain",
+        )
+        for i, url in enumerate(urls)
+    ]
+
+    # insert some documents
+    await client.insert_documents(
+        bank_id=bank.bank_id,
+        documents=documents,
+    )
+
+    # query the documents
+    response = await client.query_documents(
+        bank_id=bank.bank_id,
+        query=[
+            "How do I use Lora?",
+        ],
+    )
+    for chunk, score in zip(response.chunks, response.scores):
+        print(f"Score: {score}")
+        print(f"Chunk:\n========\n{chunk}\n========\n")
+
+    response = await client.query_documents(
+        bank_id=bank.bank_id,
+        query=[
+            "Tell me more about llama3 and torchtune",
+        ],
+    )
+    for chunk, score in zip(response.chunks, response.scores):
+        print(f"Score: {score}")
+        print(f"Chunk:\n========\n{chunk}\n========\n")
+
+
+def main(host: str, port: int, stream: bool = True):
+    asyncio.run(run_main(host, port, stream))
+
+
+if __name__ == "__main__":
+    fire.Fire(main)
--- a/llama_toolchain/agentic_system/tools/custom/init.py
+++ b/llama_toolchain/agentic_system/tools/custom/init.py
--- a/llama_toolchain/memory/meta_reference/faiss/init.py
+++ b/llama_toolchain/memory/meta_reference/faiss/init.py
@ -0,0 +1,19 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from .config import FaissImplConfig
+
+
+async def get_provider_impl(config: FaissImplConfig, _deps):
+    from .faiss import FaissMemoryImpl
+
+    assert isinstance(
+        config, FaissImplConfig
+    ), f"Unexpected config type: {type(config)}"
+
+    impl = FaissMemoryImpl(config)
+    await impl.initialize()
+    return impl
--- a/llama_toolchain/memory/meta_reference/faiss/config.py
+++ b/llama_toolchain/memory/meta_reference/faiss/config.py
@ -5,12 +5,9 @@
 # the root directory of this source tree.

 from llama_models.schema_utils import json_schema_type
-from pydantic import BaseModel, Field
+
+from pydantic import BaseModel


@json_schema_type
-class OllamaImplConfig(BaseModel):
-    url: str = Field(
-        default="http://localhost:11434",
-        description="The URL for the ollama server",
-    )
+class FaissImplConfig(BaseModel): ...
--- a/llama_toolchain/memory/meta_reference/faiss/faiss.py
+++ b/llama_toolchain/memory/meta_reference/faiss/faiss.py
@ -0,0 +1,194 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+import uuid
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Tuple
+
+import faiss
+import httpx
+import numpy as np
+
+from llama_models.llama3.api.datatypes import *  # noqa: F403
+from llama_models.llama3.api.tokenizer import Tokenizer
+
+from llama_toolchain.memory.api import *  # noqa: F403
+from .config import FaissImplConfig
+
+
+async def content_from_doc(doc: MemoryBankDocument) -> str:
+    if isinstance(doc.content, URL):
+        async with httpx.AsyncClient() as client:
+            r = await client.get(doc.content.uri)
+            return r.text
+
+    return interleaved_text_media_as_str(doc.content)
+
+
+def make_overlapped_chunks(
+    text: str, window_len: int, overlap_len: int
+) -> List[Tuple[str, int]]:
+    tokenizer = Tokenizer.get_instance()
+    tokens = tokenizer.encode(text, bos=False, eos=False)
+
+    chunks = []
+    for i in range(0, len(tokens), window_len - overlap_len):
+        toks = tokens[i : i + window_len]
+        chunk = tokenizer.decode(toks)
+        chunks.append((chunk, len(toks)))
+
+    return chunks
+
+
+@dataclass
+class BankState:
+    bank: MemoryBank
+    index: Optional[faiss.IndexFlatL2] = None
+    doc_by_id: Dict[str, MemoryBankDocument] = field(default_factory=dict)
+    id_by_index: Dict[int, str] = field(default_factory=dict)
+    chunk_by_index: Dict[int, str] = field(default_factory=dict)
+
+    async def insert_documents(
+        self,
+        model: "SentenceTransformer",
+        documents: List[MemoryBankDocument],
+    ) -> None:
+        tokenizer = Tokenizer.get_instance()
+        chunk_size = self.bank.config.chunk_size_in_tokens
+
+        for doc in documents:
+            indexlen = len(self.id_by_index)
+            self.doc_by_id[doc.document_id] = doc
+
+            content = await content_from_doc(doc)
+            chunks = make_overlapped_chunks(
+                content,
+                self.bank.config.chunk_size_in_tokens,
+                self.bank.config.overlap_size_in_tokens
+                or (self.bank.config.chunk_size_in_tokens // 4),
+            )
+            embeddings = model.encode([x[0] for x in chunks]).astype(np.float32)
+            await self._ensure_index(embeddings.shape[1])
+
+            self.index.add(embeddings)
+            for i, chunk in enumerate(chunks):
+                self.chunk_by_index[indexlen + i] = Chunk(
+                    content=chunk[0],
+                    token_count=chunk[1],
+                    document_id=doc.document_id,
+                )
+                print(f"Adding chunk #{indexlen + i} tokens={chunk[1]}")
+                self.id_by_index[indexlen + i] = doc.document_id
+
+    async def query_documents(
+        self,
+        model: "SentenceTransformer",
+        query: InterleavedTextMedia,
+        params: Optional[Dict[str, Any]] = None,
+    ) -> QueryDocumentsResponse:
+        if params is None:
+            params = {}
+        k = params.get("max_chunks", 3)
+
+        def _process(c) -> str:
+            if isinstance(c, str):
+                return c
+            else:
+                return "<media>"
+
+        if isinstance(query, list):
+            query_str = " ".join([_process(c) for c in query])
+        else:
+            query_str = _process(query)
+
+        query_vector = model.encode([query_str])[0]
+        distances, indices = self.index.search(
+            query_vector.reshape(1, -1).astype(np.float32), k
+        )
+
+        chunks = []
+        scores = []
+        for d, i in zip(distances[0], indices[0]):
+            if i < 0:
+                continue
+            chunks.append(self.chunk_by_index[int(i)])
+            scores.append(1.0 / float(d))
+
+        return QueryDocumentsResponse(chunks=chunks, scores=scores)
+
+    async def _ensure_index(self, dimension: int) -> faiss.IndexFlatL2:
+        if self.index is None:
+            self.index = faiss.IndexFlatL2(dimension)
+        return self.index
+
+
+class FaissMemoryImpl(Memory):
+    def __init__(self, config: FaissImplConfig) -> None:
+        self.config = config
+        self.model = None
+        self.states = {}
+
+    async def initialize(self) -> None: ...
+
+    async def shutdown(self) -> None: ...
+
+    async def create_memory_bank(
+        self,
+        name: str,
+        config: MemoryBankConfig,
+        url: Optional[URL] = None,
+    ) -> MemoryBank:
+        assert url is None, "URL is not supported for this implementation"
+        assert (
+            config.type == MemoryBankType.vector.value
+        ), f"Only vector banks are supported {config.type}"
+
+        bank_id = str(uuid.uuid4())
+        bank = MemoryBank(
+            bank_id=bank_id,
+            name=name,
+            config=config,
+            url=url,
+        )
+        state = BankState(bank=bank)
+        self.states[bank_id] = state
+        return bank
+
+    async def get_memory_bank(self, bank_id: str) -> Optional[MemoryBank]:
+        if bank_id not in self.states:
+            return None
+        return self.states[bank_id].bank
+
+    async def insert_documents(
+        self,
+        bank_id: str,
+        documents: List[MemoryBankDocument],
+        ttl_seconds: Optional[int] = None,
+    ) -> None:
+        assert bank_id in self.states, f"Bank {bank_id} not found"
+        state = self.states[bank_id]
+
+        await state.insert_documents(self.get_model(), documents)
+
+    async def query_documents(
+        self,
+        bank_id: str,
+        query: InterleavedTextMedia,
+        params: Optional[Dict[str, Any]] = None,
+    ) -> QueryDocumentsResponse:
+        assert bank_id in self.states, f"Bank {bank_id} not found"
+        state = self.states[bank_id]
+
+        return await state.query_documents(self.get_model(), query, params)
+
+    def get_model(self) -> "SentenceTransformer":
+        from sentence_transformers import SentenceTransformer
+
+        if self.model is None:
+            print("Loading sentence transformer")
+            self.model = SentenceTransformer("all-MiniLM-L6-v2")
+
+        return self.model
--- a/llama_toolchain/memory/providers.py
+++ b/llama_toolchain/memory/providers.py
@ -0,0 +1,25 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from typing import List
+
+from llama_toolchain.core.datatypes import Api, InlineProviderSpec, ProviderSpec
+
+
+def available_memory_providers() -> List[ProviderSpec]:
+    return [
+        InlineProviderSpec(
+            api=Api.memory,
+            provider_id="meta-reference-faiss",
+            pip_packages=[
+                "blobfile",
+                "faiss-cpu",
+                "sentence-transformers",
+            ],
+            module="llama_toolchain.memory.meta_reference.faiss",
+            config_class="llama_toolchain.memory.meta_reference.faiss.FaissImplConfig",
+        ),
+    ]
--- a/llama_toolchain/observability/api/init.py
+++ b/llama_toolchain/observability/api/init.py
@ -4,5 +4,4 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from .datatypes import *  # noqa: F401 F403
-from .endpoints import *  # noqa: F401 F403
+from .api import *  # noqa: F401 F403
--- a/llama_toolchain/observability/api/endpoints.py
+++ b/llama_toolchain/observability/api/endpoints.py
@ -5,12 +5,79 @@
 # the root directory of this source tree.

 from datetime import datetime
-from typing import Any, Dict, List, Optional, Protocol
+from enum import Enum
+
+from typing import Any, Dict, List, Optional, Protocol, Union

 from llama_models.schema_utils import json_schema_type, webmethod
+
 from pydantic import BaseModel
-from llama_models.llama3.api.datatypes import *  # noqa: F403
-from .datatypes import *  # noqa: F403
+
+
+@json_schema_type
+class ExperimentStatus(Enum):
+    NOT_STARTED = "not_started"
+    RUNNING = "running"
+    COMPLETED = "completed"
+    FAILED = "failed"
+
+
+@json_schema_type
+class Experiment(BaseModel):
+    id: str
+    name: str
+    status: ExperimentStatus
+    created_at: datetime
+    updated_at: datetime
+    metadata: Dict[str, Any]
+
+
+@json_schema_type
+class Run(BaseModel):
+    id: str
+    experiment_id: str
+    status: str
+    started_at: datetime
+    ended_at: Optional[datetime]
+    metadata: Dict[str, Any]
+
+
+@json_schema_type
+class Metric(BaseModel):
+    name: str
+    value: Union[float, int, str, bool]
+    timestamp: datetime
+    run_id: str
+
+
+@json_schema_type
+class Log(BaseModel):
+    message: str
+    level: str
+    timestamp: datetime
+    additional_info: Dict[str, Any]
+
+
+@json_schema_type
+class ArtifactType(Enum):
+    MODEL = "model"
+    DATASET = "dataset"
+    CHECKPOINT = "checkpoint"
+    PLOT = "plot"
+    METRIC = "metric"
+    CONFIG = "config"
+    CODE = "code"
+    OTHER = "other"
+
+
+@json_schema_type
+class Artifact(BaseModel):
+    id: str
+    name: str
+    type: ArtifactType
+    size: int
+    created_at: datetime
+    metadata: Dict[str, Any]


@json_schema_type
--- a/llama_toolchain/observability/api/datatypes.py
+++ b/llama_toolchain/observability/api/datatypes.py
@ -1,80 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from datetime import datetime
-from enum import Enum
-
-from typing import Any, Dict, Optional, Union
-
-from llama_models.schema_utils import json_schema_type
-
-from pydantic import BaseModel
-
-
-@json_schema_type
-class ExperimentStatus(Enum):
-    NOT_STARTED = "not_started"
-    RUNNING = "running"
-    COMPLETED = "completed"
-    FAILED = "failed"
-
-
-@json_schema_type
-class Experiment(BaseModel):
-    id: str
-    name: str
-    status: ExperimentStatus
-    created_at: datetime
-    updated_at: datetime
-    metadata: Dict[str, Any]
-
-
-@json_schema_type
-class Run(BaseModel):
-    id: str
-    experiment_id: str
-    status: str
-    started_at: datetime
-    ended_at: Optional[datetime]
-    metadata: Dict[str, Any]
-
-
-@json_schema_type
-class Metric(BaseModel):
-    name: str
-    value: Union[float, int, str, bool]
-    timestamp: datetime
-    run_id: str
-
-
-@json_schema_type
-class Log(BaseModel):
-    message: str
-    level: str
-    timestamp: datetime
-    additional_info: Dict[str, Any]
-
-
-@json_schema_type
-class ArtifactType(Enum):
-    MODEL = "model"
-    DATASET = "dataset"
-    CHECKPOINT = "checkpoint"
-    PLOT = "plot"
-    METRIC = "metric"
-    CONFIG = "config"
-    CODE = "code"
-    OTHER = "other"
-
-
-@json_schema_type
-class Artifact(BaseModel):
-    id: str
-    name: str
-    type: ArtifactType
-    size: int
-    created_at: datetime
-    metadata: Dict[str, Any]
--- a/llama_toolchain/post_training/api/init.py
+++ b/llama_toolchain/post_training/api/init.py
@ -4,5 +4,4 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from .datatypes import *  # noqa: F401 F403
-from .endpoints import *  # noqa: F401 F403
+from .api import *  # noqa: F401 F403
--- a/llama_toolchain/post_training/api/endpoints.py
+++ b/llama_toolchain/post_training/api/endpoints.py
@ -5,6 +5,7 @@
 # the root directory of this source tree.

 from datetime import datetime
+from enum import Enum

 from typing import Any, Dict, List, Optional, Protocol

@ -13,9 +14,90 @@ from llama_models.schema_utils import json_schema_type, webmethod
 from pydantic import BaseModel, Field

 from llama_models.llama3.api.datatypes import *  # noqa: F403
-from llama_toolchain.dataset.api.datatypes import *  # noqa: F403
+from llama_toolchain.dataset.api import *  # noqa: F403
 from llama_toolchain.common.training_types import *  # noqa: F403
-from .datatypes import *  # noqa: F403
+
+
+class OptimizerType(Enum):
+    adam = "adam"
+    adamw = "adamw"
+    sgd = "sgd"
+
+
+@json_schema_type
+class OptimizerConfig(BaseModel):
+    optimizer_type: OptimizerType
+    lr: float
+    lr_min: float
+    weight_decay: float
+
+
+@json_schema_type
+class TrainingConfig(BaseModel):
+    n_epochs: int
+    batch_size: int
+    shuffle: bool
+    n_iters: int
+
+    enable_activation_checkpointing: bool
+    memory_efficient_fsdp_wrap: bool
+    fsdp_cpu_offload: bool
+
+
+@json_schema_type
+class FinetuningAlgorithm(Enum):
+    full = "full"
+    lora = "lora"
+    qlora = "qlora"
+    dora = "dora"
+
+
+@json_schema_type
+class LoraFinetuningConfig(BaseModel):
+    lora_attn_modules: List[str]
+    apply_lora_to_mlp: bool
+    apply_lora_to_output: bool
+    rank: int
+    alpha: int
+
+
+@json_schema_type
+class QLoraFinetuningConfig(LoraFinetuningConfig):
+    pass
+
+
+@json_schema_type
+class DoraFinetuningConfig(LoraFinetuningConfig):
+    pass
+
+
+@json_schema_type
+class PostTrainingJobLogStream(BaseModel):
+    """Stream of logs from a finetuning job."""
+
+    job_uuid: str
+    log_lines: List[str]
+
+
+@json_schema_type
+class PostTrainingJobStatus(Enum):
+    running = "running"
+    completed = "completed"
+    failed = "failed"
+    scheduled = "scheduled"
+
+
+@json_schema_type
+class RLHFAlgorithm(Enum):
+    dpo = "dpo"
+
+
+@json_schema_type
+class DPOAlignmentConfig(BaseModel):
+    reward_scale: float
+    reward_clip: float
+    epsilon: float
+    gamma: float


@json_schema_type
--- a/llama_toolchain/post_training/api/datatypes.py
+++ b/llama_toolchain/post_training/api/datatypes.py
@ -1,94 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from enum import Enum
-from typing import List
-
-from llama_models.schema_utils import json_schema_type
-
-from pydantic import BaseModel
-
-
-class OptimizerType(Enum):
-    adam = "adam"
-    adamw = "adamw"
-    sgd = "sgd"
-
-
-@json_schema_type
-class OptimizerConfig(BaseModel):
-    optimizer_type: OptimizerType
-    lr: float
-    lr_min: float
-    weight_decay: float
-
-
-@json_schema_type
-class TrainingConfig(BaseModel):
-    n_epochs: int
-    batch_size: int
-    shuffle: bool
-    n_iters: int
-
-    enable_activation_checkpointing: bool
-    memory_efficient_fsdp_wrap: bool
-    fsdp_cpu_offload: bool
-
-
-@json_schema_type
-class FinetuningAlgorithm(Enum):
-    full = "full"
-    lora = "lora"
-    qlora = "qlora"
-    dora = "dora"
-
-
-@json_schema_type
-class LoraFinetuningConfig(BaseModel):
-    lora_attn_modules: List[str]
-    apply_lora_to_mlp: bool
-    apply_lora_to_output: bool
-    rank: int
-    alpha: int
-
-
-@json_schema_type
-class QLoraFinetuningConfig(LoraFinetuningConfig):
-    pass
-
-
-@json_schema_type
-class DoraFinetuningConfig(LoraFinetuningConfig):
-    pass
-
-
-@json_schema_type
-class PostTrainingJobLogStream(BaseModel):
-    """Stream of logs from a finetuning job."""
-
-    job_uuid: str
-    log_lines: List[str]
-
-
-@json_schema_type
-class PostTrainingJobStatus(Enum):
-    running = "running"
-    completed = "completed"
-    failed = "failed"
-    scheduled = "scheduled"
-
-
-@json_schema_type
-class RLHFAlgorithm(Enum):
-    dpo = "dpo"
-
-
-@json_schema_type
-class DPOAlignmentConfig(BaseModel):
-    reward_scale: float
-    reward_clip: float
-    epsilon: float
-    gamma: float
--- a/llama_toolchain/reward_scoring/api/init.py
+++ b/llama_toolchain/reward_scoring/api/init.py
@ -4,5 +4,4 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from .datatypes import *  # noqa: F401 F403
-from .endpoints import *  # noqa: F401 F403
+from .api import *  # noqa: F401 F403
--- a/llama_toolchain/reward_scoring/api/endpoints.py
+++ b/llama_toolchain/reward_scoring/api/endpoints.py
@ -5,9 +5,30 @@
 # the root directory of this source tree.

 from typing import List, Protocol, Union
-from .datatypes import *  # noqa: F403

-from llama_models.schema_utils import webmethod
+from llama_models.schema_utils import json_schema_type, webmethod
+
+from pydantic import BaseModel
+
+from llama_models.llama3.api.datatypes import *  # noqa: F403
+
+
+@json_schema_type
+class ScoredMessage(BaseModel):
+    message: Message
+    score: float
+
+
+@json_schema_type
+class DialogGenerations(BaseModel):
+    dialog: List[Message]
+    sampled_generations: List[Message]
+
+
+@json_schema_type
+class ScoredDialogGenerations(BaseModel):
+    dialog: List[Message]
+    scored_generations: List[ScoredMessage]


@json_schema_type
--- a/llama_toolchain/reward_scoring/api/datatypes.py
+++ b/llama_toolchain/reward_scoring/api/datatypes.py
@ -1,31 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from typing import List
-
-from llama_models.schema_utils import json_schema_type
-
-from pydantic import BaseModel
-
-from llama_models.llama3.api.datatypes import *  # noqa: F403
-
-
-@json_schema_type
-class ScoredMessage(BaseModel):
-    message: Message
-    score: float
-
-
-@json_schema_type
-class DialogGenerations(BaseModel):
-    dialog: List[Message]
-    sampled_generations: List[Message]
-
-
-@json_schema_type
-class ScoredDialogGenerations(BaseModel):
-    dialog: List[Message]
-    scored_generations: List[ScoredMessage]
--- a/llama_toolchain/safety/api/init.py
+++ b/llama_toolchain/safety/api/init.py
@ -4,5 +4,4 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-from .datatypes import *  # noqa
-from .endpoints import *  # noqa
+from .api import *  # noqa: F401 F403
--- a/llama_toolchain/safety/api/datatypes.py
+++ b/llama_toolchain/safety/api/datatypes.py
@ -5,13 +5,12 @@
 # the root directory of this source tree.

 from enum import Enum
-from typing import Dict, Optional, Union
-
-from llama_models.llama3.api.datatypes import ToolParamDefinition
-from llama_models.schema_utils import json_schema_type
+from typing import Dict, List, Optional, Protocol, Union

+from llama_models.schema_utils import json_schema_type, webmethod
 from pydantic import BaseModel, validator

+from llama_models.llama3.api.datatypes import *  # noqa: F403
 from llama_toolchain.common.deployment_types import RestAPIExecutionConfig


@ -70,3 +69,22 @@ class ShieldResponse(BaseModel):
            except ValueError:
                return v
        return v
+
+
+@json_schema_type
+class RunShieldRequest(BaseModel):
+    messages: List[Message]
+    shields: List[ShieldDefinition]
+
+
+@json_schema_type
+class RunShieldResponse(BaseModel):
+    responses: List[ShieldResponse]
+
+
+class Safety(Protocol):
+    @webmethod(route="/safety/run_shields")
+    async def run_shields(
+        self,
+        request: RunShieldRequest,
+    ) -> RunShieldResponse: ...
--- a/llama_toolchain/safety/api/endpoints.py
+++ b/llama_toolchain/safety/api/endpoints.py
@ -1,32 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from .datatypes import *  # noqa: F403
-from typing import List, Protocol
-
-from llama_models.llama3.api.datatypes import Message
-
-# this dependency is annoying and we need a forked up version anyway
-from llama_models.schema_utils import webmethod
-
-
-@json_schema_type
-class RunShieldRequest(BaseModel):
-    messages: List[Message]
-    shields: List[ShieldDefinition]
-
-
-@json_schema_type
-class RunShieldResponse(BaseModel):
-    responses: List[ShieldResponse]
-
-
-class Safety(Protocol):
-    @webmethod(route="/safety/run_shields")
-    async def run_shields(
-        self,
-        request: RunShieldRequest,
-    ) -> RunShieldResponse: ...
--- a/Show more
+++ b/Show more