move docs -> source

2025-12-08 11:07:22 +00:00 · 2024-10-29 11:20:28 -07:00 · 2024-10-29 11:20:28 -07:00 · 044b13bd36
commit 044b13bd36
parent 4aa1bf6a60
7 changed files with 191 additions and 1105 deletions
--- a/docs/cli_reference.md
+++ b/docs/cli_reference.md
@ -1,485 +0,0 @@
 # Llama CLI Reference
 The `llama` CLI tool helps you setup and use the Llama Stack & agentic systems. It should be available on your path after installing the `llama-stack` package.
 ### Subcommands
 1. `download`: `llama` cli tools supports downloading the model from Meta or Hugging Face.
 2. `model`: Lists available models and their properties.
 3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this [here](cli_reference.md#step-3-building-and-configuring-llama-stack-distributions).
 ### Sample Usage
 ```
 llama --help
 ```
 <pre style="font-family: monospace;">
 usage: llama [-h] {download,model,stack} ...
 Welcome to the Llama CLI
 options:
  -h, --help            show this help message and exit
 subcommands:
  {download,model,stack}
 </pre>
 ## Step 1. Get the models
 You first need to have models downloaded locally.
 To download any model you need the **Model Descriptor**.
 This can be obtained by running the command
 ```
 llama model list
 ```
 You should see a table like this:
 <pre style="font-family: monospace;">
 +----------------------------------+------------------------------------------+----------------+
 | Model Descriptor                 | Hugging Face Repo                        | Context Length |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.1-8B                      | meta-llama/Llama-3.1-8B                  | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.1-70B                     | meta-llama/Llama-3.1-70B                 | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.1-405B:bf16-mp8           | meta-llama/Llama-3.1-405B                | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.1-405B                    | meta-llama/Llama-3.1-405B-FP8            | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.1-405B:bf16-mp16          | meta-llama/Llama-3.1-405B                | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.1-8B-Instruct             | meta-llama/Llama-3.1-8B-Instruct         | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.1-70B-Instruct            | meta-llama/Llama-3.1-70B-Instruct        | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.1-405B-Instruct:bf16-mp8  | meta-llama/Llama-3.1-405B-Instruct       | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.1-405B-Instruct           | meta-llama/Llama-3.1-405B-Instruct-FP8   | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.1-405B-Instruct:bf16-mp16 | meta-llama/Llama-3.1-405B-Instruct       | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.2-1B                      | meta-llama/Llama-3.2-1B                  | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.2-3B                      | meta-llama/Llama-3.2-3B                  | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.2-11B-Vision              | meta-llama/Llama-3.2-11B-Vision          | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.2-90B-Vision              | meta-llama/Llama-3.2-90B-Vision          | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.2-1B-Instruct             | meta-llama/Llama-3.2-1B-Instruct         | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.2-3B-Instruct             | meta-llama/Llama-3.2-3B-Instruct         | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.2-11B-Vision-Instruct     | meta-llama/Llama-3.2-11B-Vision-Instruct | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama3.2-90B-Vision-Instruct     | meta-llama/Llama-3.2-90B-Vision-Instruct | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama-Guard-3-11B-Vision         | meta-llama/Llama-Guard-3-11B-Vision      | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama-Guard-3-1B:int4-mp1        | meta-llama/Llama-Guard-3-1B-INT4         | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama-Guard-3-1B                 | meta-llama/Llama-Guard-3-1B              | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama-Guard-3-8B                 | meta-llama/Llama-Guard-3-8B              | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama-Guard-3-8B:int8-mp1        | meta-llama/Llama-Guard-3-8B-INT8         | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Prompt-Guard-86M                 | meta-llama/Prompt-Guard-86M              | 128K           |
 +----------------------------------+------------------------------------------+----------------+
 | Llama-Guard-2-8B                 | meta-llama/Llama-Guard-2-8B              | 4K             |
 +----------------------------------+------------------------------------------+----------------+
 </pre>
 To download models, you can use the llama download command.
 #### Downloading from [Meta](https://llama.meta.com/llama-downloads/)
 Here is an example download command to get the 3B-Instruct/11B-Vision-Instruct model. You will need META_URL which can be obtained from [here](https://llama.meta.com/docs/getting_the_models/meta/)
 Download the required checkpoints using the following commands:
 ```bash
 # download the 8B model, this can be run on a single GPU
 llama download --source meta --model-id Llama3.2-3B-Instruct --meta-url META_URL
 # you can also get the 70B model, this will require 8 GPUs however
 llama download --source meta --model-id Llama3.2-11B-Vision-Instruct --meta-url META_URL
 # llama-agents have safety enabled by default. For this, you will need
 # safety models -- Llama-Guard and Prompt-Guard
 llama download --source meta --model-id Prompt-Guard-86M --meta-url META_URL
 llama download --source meta --model-id Llama-Guard-3-1B --meta-url META_URL
 ```
 #### Downloading from [Hugging Face](https://huggingface.co/meta-llama)
 Essentially, the same commands above work, just replace `--source meta` with `--source huggingface`.
 ```bash
 llama download --source huggingface --model-id  Llama3.1-8B-Instruct --hf-token <HF_TOKEN>
 llama download --source huggingface --model-id Llama3.1-70B-Instruct --hf-token <HF_TOKEN>
 llama download --source huggingface --model-id Llama-Guard-3-1B --ignore-patterns *original*
 llama download --source huggingface --model-id Prompt-Guard-86M --ignore-patterns *original*
 ```
 **Important:** Set your environment variable `HF_TOKEN` or pass in `--hf-token` to the command to validate your access. You can find your token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
 > **Tip:** Default for `llama download` is to run with `--ignore-patterns *.safetensors` since we use the `.pth` files in the `original` folder. For Llama Guard and Prompt Guard, however, we need safetensors. Hence, please run with `--ignore-patterns original` so that safetensors are downloaded and `.pth` files are ignored.
 #### Downloading via Ollama
 If you're already using ollama, we also have a supported Llama Stack distribution `local-ollama` and you can continue to use ollama for managing model downloads.
 ```
 ollama pull llama3.1:8b-instruct-fp16
 ollama pull llama3.1:70b-instruct-fp16
 ```
 > [!NOTE]
 > Only the above two models are currently supported by Ollama.
 ## Step 2: Understand the models
 The `llama model` command helps you explore the model’s interface.
 ### 2.1 Subcommands
 1. `download`: Download the model from different sources. (meta, huggingface)
 2. `list`: Lists all the models available for download with hardware requirements to deploy the models.
 3. `prompt-format`: Show llama model message formats.
 4. `describe`: Describes all the properties of the model.
 ### 2.2 Sample Usage
 `llama model <subcommand> <options>`
 ```
 llama model --help
 ```
 <pre style="font-family: monospace;">
 usage: llama model [-h] {download,list,prompt-format,describe} ...
 Work with llama models
 options:
  -h, --help            show this help message and exit
 model_subcommands:
  {download,list,prompt-format,describe}
 </pre>
 You can use the describe command to know more about a model:
 ```
 llama model describe -m Llama3.2-3B-Instruct
 ```
 ### 2.3 Describe
 <pre style="font-family: monospace;">
 +-----------------------------+----------------------------------+
 | Model                       | Llama3.2-3B-Instruct             |
 +-----------------------------+----------------------------------+
 | Hugging Face ID             | meta-llama/Llama-3.2-3B-Instruct |
 +-----------------------------+----------------------------------+
 | Description                 | Llama 3.2 3b instruct model      |
 +-----------------------------+----------------------------------+
 | Context Length              | 128K tokens                      |
 +-----------------------------+----------------------------------+
 | Weights format              | bf16                             |
 +-----------------------------+----------------------------------+
 | Model params.json           | {                                |
 |                             |     "dim": 3072,                 |
 |                             |     "n_layers": 28,              |
 |                             |     "n_heads": 24,               |
 |                             |     "n_kv_heads": 8,             |
 |                             |     "vocab_size": 128256,        |
 |                             |     "ffn_dim_multiplier": 1.0,   |
 |                             |     "multiple_of": 256,          |
 |                             |     "norm_eps": 1e-05,           |
 |                             |     "rope_theta": 500000.0,      |
 |                             |     "use_scaled_rope": true      |
 |                             | }                                |
 +-----------------------------+----------------------------------+
 | Recommended sampling params | {                                |
 |                             |     "strategy": "top_p",         |
 |                             |     "temperature": 1.0,          |
 |                             |     "top_p": 0.9,                |
 |                             |     "top_k": 0                   |
 |                             | }                                |
 +-----------------------------+----------------------------------+
 </pre>
 ### 2.4 Prompt Format
 You can even run `llama model prompt-format` see all of the templates and their tokens:
 ```
 llama model prompt-format -m Llama3.2-3B-Instruct
 ```
 ![alt text](resources/prompt-format.png)
 You will be shown a Markdown formatted description of the model interface and how prompts / messages are formatted for various scenarios.
 **NOTE**: Outputs in terminal are color printed to show special tokens.
 ## Step 3: Building, and Configuring Llama Stack Distributions
 - Please see our [Getting Started](getting_started.md) guide for more details on how to build and start a Llama Stack distribution.
 ### Step 3.1 Build
 In the following steps, imagine we'll be working with a `Llama3.1-8B-Instruct` model. We will name our build `tgi` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify:
 - `name`: the name for our distribution (e.g. `tgi`)
 - `image_type`: our build image type (`conda | docker`)
 - `distribution_spec`: our distribution specs for specifying API providers
  - `description`: a short description of the configurations for the distribution
  - `providers`: specifies the underlying implementation for serving each API endpoint
  - `image_type`: `conda` | `docker` to specify whether to build the distribution in the form of Docker image or Conda environment.
 At the end of build command, we will generate `<name>-build.yaml` file storing the build configurations.
 After this step is complete, a file named `<name>-build.yaml` will be generated and saved at the output file path specified at the end of the command.
 #### Building from scratch
 - For a new user, we could start off with running `llama stack build` which will allow you to a interactively enter wizard where you will be prompted to enter build configurations.
 ```
 llama stack build
 ```
 Running the command above will allow you to fill in the configuration to build your Llama Stack distribution, you will see the following outputs.
 ```
 > Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local-llama-stack
 > Enter the image type you want your distribution to be built with (docker or conda): conda
 Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs.
 > Enter the API provider for the inference API: (default=meta-reference): meta-reference
 > Enter the API provider for the safety API: (default=meta-reference): meta-reference
 > Enter the API provider for the agents API: (default=meta-reference): meta-reference
 > Enter the API provider for the memory API: (default=meta-reference): meta-reference
 > Enter the API provider for the telemetry API: (default=meta-reference): meta-reference
 > (Optional) Enter a short description for your Llama Stack distribution:
 Build spec configuration saved at ~/.conda/envs/llamastack-my-local-llama-stack/my-local-llama-stack-build.yaml
 ```
 #### Building from templates
 - To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers.
 The following command will allow you to see the available templates and their corresponding providers.
 ```
 llama stack build --list-templates
 ```
 ![alt text](resources/list-templates.png)
 You may then pick a template to build your distribution with providers fitted to your liking.
 ```
 llama stack build --template tgi --image-type conda
 ```
 ```
 $ llama stack build --template tgi --image-type conda
 ...
 ...
 Build spec configuration saved at ~/.conda/envs/llamastack-tgi/tgi-build.yaml
 You may now run `llama stack configure tgi` or `llama stack configure ~/.conda/envs/llamastack-tgi/tgi-build.yaml`
 ```
 #### Building from config file
 - In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command.
 - The config file will be of contents like the ones in `llama_stack/templates/`.
 ```
 $ cat build.yaml
 name: ollama
 distribution_spec:
  description: Like local, but use ollama for running LLM inference
  providers:
    inference: remote::ollama
    memory: meta-reference
    safety: meta-reference
    agents: meta-reference
    telemetry: meta-reference
 image_type: conda
 ```
 ```
 llama stack build --config build.yaml
 ```
 #### How to build distribution with Docker image
 To build a docker image, you may start off from a template and use the `--image-type docker` flag to specify `docker` as the build image type.
 ```
 llama stack build --template tgi --image-type docker
 ```
 Alternatively, you may use a config file and set `image_type` to `docker` in our `<name>-build.yaml` file, and run `llama stack build <name>-build.yaml`. The `<name>-build.yaml` will be of contents like:
 ```
 name: local-docker-example
 distribution_spec:
  description: Use code from `llama_stack` itself to serve all llama stack APIs
  docker_image: null
  providers:
    inference: meta-reference
    memory: meta-reference-faiss
    safety: meta-reference
    agentic_system: meta-reference
    telemetry: console
 image_type: docker
 ```
 The following command allows you to build a Docker image with the name `<name>`
 ```
 llama stack build --config <name>-build.yaml
 Dockerfile created successfully in /tmp/tmp.I0ifS2c46A/DockerfileFROM python:3.10-slim
 WORKDIR /app
 ...
 ...
 You can run it with: podman run -p 8000:8000 llamastack-docker-local
 Build spec configuration saved at ~/.llama/distributions/docker/docker-local-build.yaml
 ```
 ### Step 3.2 Configure
 After our distribution is built (either in form of docker or conda environment), we will run the following command to
 ```
 llama stack configure [ <docker-image-name> | <path/to/name-build.yaml>]
 ```
 - For `conda` environments: <path/to/name.build.yaml> would be the generated build spec saved from Step 1.
 - For `docker` images downloaded from Dockerhub, you could also use <docker-image-name> as the argument.
   - Run `docker images` to check list of available images on your machine.
 ```
 $ llama stack configure ~/.llama/distributions/conda/tgi-build.yaml
 Configuring API: inference (meta-reference)
 Enter value for model (existing: Llama3.1-8B-Instruct) (required):
 Enter value for quantization (optional):
 Enter value for torch_seed (optional):
 Enter value for max_seq_len (existing: 4096) (required):
 Enter value for max_batch_size (existing: 1) (required):
 Configuring API: memory (meta-reference-faiss)
 Configuring API: safety (meta-reference)
 Do you want to configure llama_guard_shield? (y/n): y
 Entering sub-configuration for llama_guard_shield:
 Enter value for model (default: Llama-Guard-3-1B) (required):
 Enter value for excluded_categories (default: []) (required):
 Enter value for disable_input_check (default: False) (required):
 Enter value for disable_output_check (default: False) (required):
 Do you want to configure prompt_guard_shield? (y/n): y
 Entering sub-configuration for prompt_guard_shield:
 Enter value for model (default: Prompt-Guard-86M) (required):
 Configuring API: agentic_system (meta-reference)
 Enter value for brave_search_api_key (optional):
 Enter value for bing_search_api_key (optional):
 Enter value for wolfram_api_key (optional):
 Configuring API: telemetry (console)
 YAML configuration has been written to ~/.llama/builds/conda/8b-instruct-run.yaml
 ```
 After this step is successful, you should be able to find a run configuration spec in `~/.llama/builds/conda/8b-instruct-run.yaml` with the following contents. You may edit this file to change the settings.
 As you can see, we did basic configuration above and configured:
 - inference to run on model `Llama3.1-8B-Instruct` (obtained from `llama model list`)
 - Llama Guard safety shield with model `Llama-Guard-3-1B`
 - Prompt Guard safety shield with model `Prompt-Guard-86M`
 For how these configurations are stored as yaml, checkout the file printed at the end of the configuration.
 Note that all configurations as well as models are stored in `~/.llama`
 ### Step 3.3 Run
 Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack configure` step.
 ```
 llama stack run ~/.llama/builds/conda/tgi-run.yaml
 ```
 You should see the Llama Stack server start and print the APIs that it is supporting
 ```
 $ llama stack run ~/.llama/builds/local/conda/tgi-run.yaml
 > initializing model parallel with size 1
 > initializing ddp with size 1
 > initializing pipeline with size 1
 Loaded in 19.28 seconds
 NCCL version 2.20.5+cuda12.4
 Finished model load YES READY
 Serving POST /inference/batch_chat_completion
 Serving POST /inference/batch_completion
 Serving POST /inference/chat_completion
 Serving POST /inference/completion
 Serving POST /safety/run_shield
 Serving POST /agentic_system/memory_bank/attach
 Serving POST /agentic_system/create
 Serving POST /agentic_system/session/create
 Serving POST /agentic_system/turn/create
 Serving POST /agentic_system/delete
 Serving POST /agentic_system/session/delete
 Serving POST /agentic_system/memory_bank/detach
 Serving POST /agentic_system/session/get
 Serving POST /agentic_system/step/get
 Serving POST /agentic_system/turn/get
 Listening on :::5000
 INFO:     Started server process [453333]
 INFO:     Waiting for application startup.
 INFO:     Application startup complete.
 INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
 ```
 > [!NOTE]
 > Configuration is in `~/.llama/builds/local/conda/tgi-run.yaml`. Feel free to increase `max_seq_len`.
 > [!IMPORTANT]
 > The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines.
 > [!TIP]
 > You might need to use the flag `--disable-ipv6` to  Disable IPv6 support
 This server is running a Llama model locally.
 ### Step 3.4 Test with Client
 Once the server is setup, we can test it with a client to see the example outputs.
 ```
 cd /path/to/llama-stack
 conda activate <env>  # any environment containing the llama-stack pip package will work
 python -m llama_stack.apis.inference.client localhost 5000
 ```
 This will run the chat completion client and query the distribution’s /inference/chat_completion API.
 Here is an example output:
 ```
 User>hello world, write me a 2 sentence poem about the moon
 Assistant> Here's a 2-sentence poem about the moon:
 The moon glows softly in the midnight sky,
 A beacon of wonder, as it passes by.
 ```
 Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by:
 ```
 python -m llama_stack.apis.safety.client localhost 5000
 ```
 You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
--- a/docs/getting_started.md
+++ b/docs/getting_started.md
@ -1,230 +0,0 @@
 # Getting Started with Llama Stack
 This guide will walk you though the steps to get started on end-to-end flow for LlamaStack. This guide mainly focuses on getting started with building a LlamaStack distribution, and starting up a LlamaStack server. Please see our [documentations](../README.md) on what you can do with Llama Stack, and [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) on examples apps built with Llama Stack.
 ## Installation
 The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-stack` package.
 You have two ways to install this repository:
 1. **Install as a package**:
   You can install the repository directly from [PyPI](https://pypi.org/project/llama-stack/) by running the following command:
   ```bash
   pip install llama-stack
   ```
 2. **Install from source**:
   If you prefer to install from the source code, follow these steps:
   ```bash
    mkdir -p ~/local
    cd ~/local
    git clone git@github.com:meta-llama/llama-stack.git
    conda create -n stack python=3.10
    conda activate stack
    cd llama-stack
    $CONDA_PREFIX/bin/pip install -e .
   ```
 For what you can do with the Llama CLI, please refer to [CLI Reference](./cli_reference.md).
 ## Starting Up Llama Stack Server
 You have two ways to start up Llama stack server:
 1. **Starting up server via docker**:
 We provide pre-built Docker image of Llama Stack distribution, which can be found in the following links in the [distributions](../distributions/) folder.
 > [!NOTE]
 > For GPU inference, you need to set these environment variables for specifying local directory containing your model checkpoints, and enable GPU inference to start running docker container.
 ```
 export LLAMA_CHECKPOINT_DIR=~/.llama
 ```
 > [!NOTE]
 > `~/.llama` should be the path containing downloaded weights of Llama models.
 To download llama models, use
 ```
 llama download --model-id Llama3.1-8B-Instruct
 ```
 To download and start running a pre-built docker container, you may use the following commands:
 ```
 cd llama-stack/distributions/meta-reference-gpu
 docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
 ```
 > [!TIP]
 > Pro Tip: We may use `docker compose up` for starting up a distribution with remote providers (e.g. TGI) using [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general). You can checkout [these scripts](../distributions/) to help you get started.
 2. **Build->Configure->Run Llama Stack server via conda**:
 	You may also build a LlamaStack distribution from scratch, configure it, and start running the distribution. This is useful for developing on LlamaStack.
 	**`llama stack build`**
 	- You'll be prompted to enter build information interactively.
 	```
 	llama stack build
 	> Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local-stack
 	> Enter the image type you want your distribution to be built with (docker or conda): conda
 	Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs.
 	> Enter the API provider for the inference API: (default=meta-reference): meta-reference
 	> Enter the API provider for the safety API: (default=meta-reference): meta-reference
 	> Enter the API provider for the agents API: (default=meta-reference): meta-reference
 	> Enter the API provider for the memory API: (default=meta-reference): meta-reference
 	> Enter the API provider for the telemetry API: (default=meta-reference): meta-reference
 	> (Optional) Enter a short description for your Llama Stack distribution:
 	Build spec configuration saved at ~/.conda/envs/llamastack-my-local-stack/my-local-stack-build.yaml
 	You can now run `llama stack configure my-local-stack`
 	```
 	**`llama stack configure`**
 	- Run `llama stack configure <name>` with the name you have previously defined in `build` step.
 	```
 	llama stack configure <name>
 	```
 	- You will be prompted to enter configurations for your Llama Stack
 	```
 	$ llama stack configure my-local-stack
 	Configuring API `inference`...
 	=== Configuring provider `meta-reference` for API inference...
 	Enter value for model (default: Llama3.1-8B-Instruct) (required):
 	Do you want to configure quantization? (y/n): n
 	Enter value for torch_seed (optional):
 	Enter value for max_seq_len (default: 4096) (required):
 	Enter value for max_batch_size (default: 1) (required):
 	Configuring API `safety`...
 	=== Configuring provider `meta-reference` for API safety...
 	Do you want to configure llama_guard_shield? (y/n): n
 	Do you want to configure prompt_guard_shield? (y/n): n
 	Configuring API `agents`...
 	=== Configuring provider `meta-reference` for API agents...
 	Enter `type` for persistence_store (options: redis, sqlite, postgres) (default: sqlite):
 	Configuring SqliteKVStoreConfig:
 	Enter value for namespace (optional):
 	Enter value for db_path (default: /home/xiyan/.llama/runtime/kvstore.db) (required):
 	Configuring API `memory`...
 	=== Configuring provider `meta-reference` for API memory...
 	> Please enter the supported memory bank type your provider has for memory: vector
 	Configuring API `telemetry`...
 	=== Configuring provider `meta-reference` for API telemetry...
 	> YAML configuration has been written to ~/.llama/builds/conda/my-local-stack-run.yaml.
 	You can now run `llama stack run my-local-stack --port PORT`
 	```
 	**`llama stack run`**
 	- Run `llama stack run <name>` with the name you have previously defined.
 	```
 	llama stack run my-local-stack
 	...
 	> initializing model parallel with size 1
 	> initializing ddp with size 1
 	> initializing pipeline with size 1
 	...
 	Finished model load YES READY
 	Serving POST /inference/chat_completion
 	Serving POST /inference/completion
 	Serving POST /inference/embeddings
 	Serving POST /memory_banks/create
 	Serving DELETE /memory_bank/documents/delete
 	Serving DELETE /memory_banks/drop
 	Serving GET /memory_bank/documents/get
 	Serving GET /memory_banks/get
 	Serving POST /memory_bank/insert
 	Serving GET /memory_banks/list
 	Serving POST /memory_bank/query
 	Serving POST /memory_bank/update
 	Serving POST /safety/run_shield
 	Serving POST /agentic_system/create
 	Serving POST /agentic_system/session/create
 	Serving POST /agentic_system/turn/create
 	Serving POST /agentic_system/delete
 	Serving POST /agentic_system/session/delete
 	Serving POST /agentic_system/session/get
 	Serving POST /agentic_system/step/get
 	Serving POST /agentic_system/turn/get
 	Serving GET /telemetry/get_trace
 	Serving POST /telemetry/log_event
 	Listening on :::5000
 	INFO:     Started server process [587053]
 	INFO:     Waiting for application startup.
 	INFO:     Application startup complete.
 	INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
 	```
 ## Testing with client
 Once the server is setup, we can test it with a client to see the example outputs.
 ```
 cd /path/to/llama-stack
 conda activate <env>  # any environment containing the llama-stack pip package will work
 python -m llama_stack.apis.inference.client localhost 5000
 ```
 This will run the chat completion client and query the distribution’s `/inference/chat_completion` API.
 Here is an example output:
 ```
 User>hello world, write me a 2 sentence poem about the moon
 Assistant> Here's a 2-sentence poem about the moon:
 The moon glows softly in the midnight sky,
 A beacon of wonder, as it passes by.
 ```
 You may also send a POST request to the server:
 ```
 curl http://localhost:5000/inference/chat_completion \
 -H "Content-Type: application/json" \
 -d '{
 	"model": "Llama3.1-8B-Instruct",
 	"messages": [
 		{"role": "system", "content": "You are a helpful assistant."},
 		{"role": "user", "content": "Write me a 2 sentence poem about the moon"}
 	],
 	"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
 }'
 Output:
 {'completion_message': {'role': 'assistant',
  'content': 'The moon glows softly in the midnight sky, \nA beacon of wonder, as it catches the eye.',
  'stop_reason': 'out_of_tokens',
  'tool_calls': []},
 'logprobs': null}
 ```
 Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by:
 ```
 python -m llama_stack.apis.safety.client localhost 5000
 ```
 Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications.
 You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
 ## Advanced Guides
 Please see our [Building a LLama Stack Distribution](./building_distro.md) guide for more details on how to assemble your own Llama Stack Distribution.
--- a/docs/source/building_distro.md
+++ b/docs/source/building_distro.md
--- a/docs/source/cli_reference.md
+++ b/docs/source/cli_reference.md
@ -2,12 +2,12 @@
 The `llama` CLI tool helps you setup and use the Llama Stack & agentic systems. It should be available on your path after installing the `llama-stack` package.
-## Subcommands
+### Subcommands
 1. `download`: `llama` cli tools supports downloading the model from Meta or Hugging Face.
 2. `model`: Lists available models and their properties.
-3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this in Step 3 below.
+3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this [here](cli_reference.md#step-3-building-and-configuring-llama-stack-distributions).
-## Sample Usage
+### Sample Usage
 ```
 llama --help
@ -94,7 +94,7 @@ You should see a table like this:
 To download models, you can use the llama download command.
-### Downloading from [Meta](https://llama.meta.com/llama-downloads/)
+#### Downloading from [Meta](https://llama.meta.com/llama-downloads/)
 Here is an example download command to get the 3B-Instruct/11B-Vision-Instruct model. You will need META_URL which can be obtained from [here](https://llama.meta.com/docs/getting_the_models/meta/)
@ -112,7 +112,7 @@ llama download --source meta --model-id Prompt-Guard-86M --meta-url META_URL
 llama download --source meta --model-id Llama-Guard-3-1B --meta-url META_URL
 ```
-### Downloading from [Hugging Face](https://huggingface.co/meta-llama)
+#### Downloading from [Hugging Face](https://huggingface.co/meta-llama)
 Essentially, the same commands above work, just replace `--source meta` with `--source huggingface`.
@ -129,7 +129,7 @@ llama download --source huggingface --model-id Prompt-Guard-86M --ignore-pattern
 > **Tip:** Default for `llama download` is to run with `--ignore-patterns *.safetensors` since we use the `.pth` files in the `original` folder. For Llama Guard and Prompt Guard, however, we need safetensors. Hence, please run with `--ignore-patterns original` so that safetensors are downloaded and `.pth` files are ignored.
-### Downloading via Ollama
+#### Downloading via Ollama
 If you're already using ollama, we also have a supported Llama Stack distribution `local-ollama` and you can continue to use ollama for managing model downloads.
@ -215,7 +215,7 @@ You can even run `llama model prompt-format` see all of the templates and their
 ```
 llama model prompt-format -m Llama3.2-3B-Instruct
 ```
-![alt text](https://github.com/meta-llama/llama-stack/docs/resources/prompt-format.png)
+![alt text](resources/prompt-format.png)
@ -229,8 +229,8 @@ You will be shown a Markdown formatted description of the model interface and ho
 - Please see our [Getting Started](getting_started.md) guide for more details on how to build and start a Llama Stack distribution.
 ### Step 3.1 Build
-In the following steps, imagine we'll be working with a `Llama3.1-8B-Instruct` model. We will name our build `8b-instruct` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify:
+In the following steps, imagine we'll be working with a `Llama3.1-8B-Instruct` model. We will name our build `tgi` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify:
- `name`: the name for our distribution (e.g. `8b-instruct`)
+- `name`: the name for our distribution (e.g. `tgi`)
 - `image_type`: our build image type (`conda | docker`)
 - `distribution_spec`: our distribution specs for specifying API providers
  - `description`: a short description of the configurations for the distribution
@ -274,16 +274,16 @@ The following command will allow you to see the available templates and their co
 llama stack build --list-templates
 ```
-![alt text](https://github.com/meta-llama/llama-stack/docs/resources/list-templates.png)
+![alt text](resources/list-templates.png)
 You may then pick a template to build your distribution with providers fitted to your liking.
 ```
-llama stack build --template tgi
+llama stack build --template tgi --image-type conda
 ```
 ```
-$ llama stack build --template tgi
+$ llama stack build --template tgi --image-type conda
 ...
 ...
 Build spec configuration saved at ~/.conda/envs/llamastack-tgi/tgi-build.yaml
@ -293,10 +293,10 @@ You may now run `llama stack configure tgi` or `llama stack configure ~/.conda/e
 #### Building from config file
 - In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command.
- The config file will be of contents like the ones in `llama_stack/distributions/templates/`.
+- The config file will be of contents like the ones in `llama_stack/templates/`.
 ```
-$ cat llama_stack/templates/ollama/build.yaml
+$ cat build.yaml
 name: ollama
 distribution_spec:
@ -311,7 +311,7 @@ image_type: conda
 ```
 ```
-llama stack build --config llama_stack/templates/ollama/build.yaml
+llama stack build --config build.yaml
 ```
 #### How to build distribution with Docker image
@ -319,7 +319,7 @@ llama stack build --config llama_stack/templates/ollama/build.yaml
 To build a docker image, you may start off from a template and use the `--image-type docker` flag to specify `docker` as the build image type.
 ```
-llama stack build --template local --image-type docker
+llama stack build --template tgi --image-type docker
 ```
 Alternatively, you may use a config file and set `image_type` to `docker` in our `<name>-build.yaml` file, and run `llama stack build <name>-build.yaml`. The `<name>-build.yaml` will be of contents like:
@ -354,7 +354,7 @@ Build spec configuration saved at ~/.llama/distributions/docker/docker-local-bui
 ### Step 3.2 Configure
 After our distribution is built (either in form of docker or conda environment), we will run the following command to
 ```
-llama stack configure [ <docker-image-name> | <path/to/name.build.yaml>]
+llama stack configure [ <docker-image-name> | <path/to/name-build.yaml>]
 ```
 - For `conda` environments: <path/to/name.build.yaml> would be the generated build spec saved from Step 1.
 - For `docker` images downloaded from Dockerhub, you could also use <docker-image-name> as the argument.
@ -390,10 +390,10 @@ Enter value for wolfram_api_key (optional):
 Configuring API: telemetry (console)
-YAML configuration has been written to ~/.llama/builds/conda/tgi-run.yaml
+YAML configuration has been written to ~/.llama/builds/conda/8b-instruct-run.yaml
 ```
-After this step is successful, you should be able to find a run configuration spec in `~/.llama/builds/conda/tgi-run.yaml` with the following contents. You may edit this file to change the settings.
+After this step is successful, you should be able to find a run configuration spec in `~/.llama/builds/conda/8b-instruct-run.yaml` with the following contents. You may edit this file to change the settings.
 As you can see, we did basic configuration above and configured:
 - inference to run on model `Llama3.1-8B-Instruct` (obtained from `llama model list`)
@ -415,7 +415,7 @@ llama stack run ~/.llama/builds/conda/tgi-run.yaml
 You should see the Llama Stack server start and print the APIs that it is supporting
 ```
-$ llama stack run ~/.llama/builds/conda/tgi-run.yaml
+$ llama stack run ~/.llama/builds/local/conda/tgi-run.yaml
 > initializing model parallel with size 1
 > initializing ddp with size 1
--- a/docs/source/developer_cookbook.md
+++ b/docs/source/developer_cookbook.md
--- a/docs/source/getting_started.md
+++ b/docs/source/getting_started.md
@ -1,37 +1,41 @@
-# Getting Started
+# Getting Started with Llama Stack
-This guide will walk you though the steps to get started on end-to-end flow for LlamaStack. This guide mainly focuses on getting started with building a LlamaStack distribution, and starting up a LlamaStack server. Please see our [documentations](https://github.com/meta-llama/llama-stack/README.md) on what you can do with Llama Stack, and [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) on examples apps built with Llama Stack.
+This guide will walk you though the steps to get started on end-to-end flow for LlamaStack. This guide mainly focuses on getting started with building a LlamaStack distribution, and starting up a LlamaStack server. Please see our [documentations](../README.md) on what you can do with Llama Stack, and [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) on examples apps built with Llama Stack.
 ## Installation
 The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-stack` package.
-You can install this repository as a [package](https://pypi.org/project/llama-stack/) with `pip install llama-stack`
+You have two ways to install this repository:
-If you want to install from source:
+1. **Install as a package**:
   You can install the repository directly from [PyPI](https://pypi.org/project/llama-stack/) by running the following command:
   ```bash
   pip install llama-stack
   ```
-```bash
+2. **Install from source**:
-mkdir -p ~/local
+   If you prefer to install from the source code, follow these steps:
-cd ~/local
+   ```bash
-git clone git@github.com:meta-llama/llama-stack.git
+    mkdir -p ~/local
    cd ~/local
    git clone git@github.com:meta-llama/llama-stack.git
-conda create -n stack python=3.10
+    conda create -n stack python=3.10
-conda activate stack
+    conda activate stack
-cd llama-stack
+    cd llama-stack
-$CONDA_PREFIX/bin/pip install -e .
+    $CONDA_PREFIX/bin/pip install -e .
-```
+   ```
 For what you can do with the Llama CLI, please refer to [CLI Reference](./cli_reference.md).
-## Quick Starting Llama Stack Server
+## Starting Up Llama Stack Server
-### Starting up server via docker
+You have two ways to start up Llama stack server:
-We provide 2 pre-built Docker image of Llama Stack distribution, which can be found in the following links.
+1. **Starting up server via docker**:
- [llamastack-local-gpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-gpu/general)
+
-  - This is a packaged version with our local meta-reference implementations, where you will be running inference locally with downloaded Llama model checkpoints.
+We provide pre-built Docker image of Llama Stack distribution, which can be found in the following links in the [distributions](../distributions/) folder.
 - [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general)
   - This is a lite version with remote inference where you can hook up to your favourite remote inference framework (e.g. ollama, fireworks, together, tgi) for running inference without GPU.
 > [!NOTE]
 > For GPU inference, you need to set these environment variables for specifying local directory containing your model checkpoints, and enable GPU inference to start running docker container.
@ -42,362 +46,132 @@ export LLAMA_CHECKPOINT_DIR=~/.llama
 > [!NOTE]
 > `~/.llama` should be the path containing downloaded weights of Llama models.
 To download llama models, use
 ```
 llama download --model-id Llama3.1-8B-Instruct
 ```
 To download and start running a pre-built docker container, you may use the following commands:
 ```
-docker run -it -p 5000:5000 -v ~/.llama:/root/.llama --gpus=all llamastack/llamastack-local-gpu
+cd llama-stack/distributions/meta-reference-gpu
 docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
 ```
 > [!TIP]
-> Pro Tip: We may use `docker compose up` for starting up a distribution with remote providers (e.g. TGI) using [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general). You can checkout [these scripts](https://github.com/meta-llama/llama-stack/llama_stack/distribution/docker/README.md) to help you get started.
+> Pro Tip: We may use `docker compose up` for starting up a distribution with remote providers (e.g. TGI) using [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general). You can checkout [these scripts](../distributions/) to help you get started.
-
+
-### Build->Configure->Run Llama Stack server via conda
+
-You may also build a LlamaStack distribution from scratch, configure it, and start running the distribution. This is useful for developing on LlamaStack.
+2. **Build->Configure->Run Llama Stack server via conda**:
-
+
-**`llama stack build`**
+	You may also build a LlamaStack distribution from scratch, configure it, and start running the distribution. This is useful for developing on LlamaStack.
- You'll be prompted to enter build information interactively.
+
-```
+	**`llama stack build`**
-llama stack build
+	- You'll be prompted to enter build information interactively.
-
+	```
-> Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local-stack
+	llama stack build
-> Enter the image type you want your distribution to be built with (docker or conda): conda
+
-
+	> Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local-stack
- Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs.
+	> Enter the image type you want your distribution to be built with (docker or conda): conda
-> Enter the API provider for the inference API: (default=meta-reference): meta-reference
+
-> Enter the API provider for the safety API: (default=meta-reference): meta-reference
+	Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs.
-> Enter the API provider for the agents API: (default=meta-reference): meta-reference
+	> Enter the API provider for the inference API: (default=meta-reference): meta-reference
-> Enter the API provider for the memory API: (default=meta-reference): meta-reference
+	> Enter the API provider for the safety API: (default=meta-reference): meta-reference
-> Enter the API provider for the telemetry API: (default=meta-reference): meta-reference
+	> Enter the API provider for the agents API: (default=meta-reference): meta-reference
-
+	> Enter the API provider for the memory API: (default=meta-reference): meta-reference
- > (Optional) Enter a short description for your Llama Stack distribution:
+	> Enter the API provider for the telemetry API: (default=meta-reference): meta-reference
-
+
-Build spec configuration saved at ~/.conda/envs/llamastack-my-local-stack/my-local-stack-build.yaml
+	> (Optional) Enter a short description for your Llama Stack distribution:
-You can now run `llama stack configure my-local-stack`
+
-```
+	Build spec configuration saved at ~/.conda/envs/llamastack-my-local-stack/my-local-stack-build.yaml
-
+	You can now run `llama stack configure my-local-stack`
-**`llama stack configure`**
+	```
- Run `llama stack configure <name>` with the name you have previously defined in `build` step.
+
-```
+	**`llama stack configure`**
-llama stack configure <name>
+	- Run `llama stack configure <name>` with the name you have previously defined in `build` step.
-```
+	```
- You will be prompted to enter configurations for your Llama Stack
+	llama stack configure <name>
-
+	```
-```
+	- You will be prompted to enter configurations for your Llama Stack
-$ llama stack configure my-local-stack
+
-
+	```
-Configuring API `inference`...
+	$ llama stack configure my-local-stack
-=== Configuring provider `meta-reference` for API inference...
+
-Enter value for model (default: Llama3.1-8B-Instruct) (required):
+	Configuring API `inference`...
-Do you want to configure quantization? (y/n): n
+	=== Configuring provider `meta-reference` for API inference...
-Enter value for torch_seed (optional):
+	Enter value for model (default: Llama3.1-8B-Instruct) (required):
-Enter value for max_seq_len (default: 4096) (required):
+	Do you want to configure quantization? (y/n): n
-Enter value for max_batch_size (default: 1) (required):
+	Enter value for torch_seed (optional):
-
+	Enter value for max_seq_len (default: 4096) (required):
-Configuring API `safety`...
+	Enter value for max_batch_size (default: 1) (required):
-=== Configuring provider `meta-reference` for API safety...
+
-Do you want to configure llama_guard_shield? (y/n): n
+	Configuring API `safety`...
-Do you want to configure prompt_guard_shield? (y/n): n
+	=== Configuring provider `meta-reference` for API safety...
-
+	Do you want to configure llama_guard_shield? (y/n): n
-Configuring API `agents`...
+	Do you want to configure prompt_guard_shield? (y/n): n
-=== Configuring provider `meta-reference` for API agents...
+
-Enter `type` for persistence_store (options: redis, sqlite, postgres) (default: sqlite):
+	Configuring API `agents`...
-
+	=== Configuring provider `meta-reference` for API agents...
-Configuring SqliteKVStoreConfig:
+	Enter `type` for persistence_store (options: redis, sqlite, postgres) (default: sqlite):
-Enter value for namespace (optional):
+
-Enter value for db_path (default: /home/xiyan/.llama/runtime/kvstore.db) (required):
+	Configuring SqliteKVStoreConfig:
-
+	Enter value for namespace (optional):
-Configuring API `memory`...
+	Enter value for db_path (default: /home/xiyan/.llama/runtime/kvstore.db) (required):
-=== Configuring provider `meta-reference` for API memory...
+
-> Please enter the supported memory bank type your provider has for memory: vector
+	Configuring API `memory`...
-
+	=== Configuring provider `meta-reference` for API memory...
-Configuring API `telemetry`...
+	> Please enter the supported memory bank type your provider has for memory: vector
-=== Configuring provider `meta-reference` for API telemetry...
+
-
+	Configuring API `telemetry`...
-> YAML configuration has been written to ~/.llama/builds/conda/my-local-stack-run.yaml.
+	=== Configuring provider `meta-reference` for API telemetry...
-You can now run `llama stack run my-local-stack --port PORT`
+
-```
+	> YAML configuration has been written to ~/.llama/builds/conda/my-local-stack-run.yaml.
-
+	You can now run `llama stack run my-local-stack --port PORT`
-**`llama stack run`**
+	```
- Run `llama stack run <name>` with the name you have previously defined.
+
-```
+	**`llama stack run`**
-llama stack run my-local-stack
+	- Run `llama stack run <name>` with the name you have previously defined.
-
+	```
-...
+	llama stack run my-local-stack
-> initializing model parallel with size 1
+
-> initializing ddp with size 1
+	...
-> initializing pipeline with size 1
+	> initializing model parallel with size 1
-...
+	> initializing ddp with size 1
-Finished model load YES READY
+	> initializing pipeline with size 1
-Serving POST /inference/chat_completion
+	...
-Serving POST /inference/completion
+	Finished model load YES READY
-Serving POST /inference/embeddings
+	Serving POST /inference/chat_completion
-Serving POST /memory_banks/create
+	Serving POST /inference/completion
-Serving DELETE /memory_bank/documents/delete
+	Serving POST /inference/embeddings
-Serving DELETE /memory_banks/drop
+	Serving POST /memory_banks/create
-Serving GET /memory_bank/documents/get
+	Serving DELETE /memory_bank/documents/delete
-Serving GET /memory_banks/get
+	Serving DELETE /memory_banks/drop
-Serving POST /memory_bank/insert
+	Serving GET /memory_bank/documents/get
-Serving GET /memory_banks/list
+	Serving GET /memory_banks/get
-Serving POST /memory_bank/query
+	Serving POST /memory_bank/insert
-Serving POST /memory_bank/update
+	Serving GET /memory_banks/list
-Serving POST /safety/run_shield
+	Serving POST /memory_bank/query
-Serving POST /agentic_system/create
+	Serving POST /memory_bank/update
-Serving POST /agentic_system/session/create
+	Serving POST /safety/run_shield
-Serving POST /agentic_system/turn/create
+	Serving POST /agentic_system/create
-Serving POST /agentic_system/delete
+	Serving POST /agentic_system/session/create
-Serving POST /agentic_system/session/delete
+	Serving POST /agentic_system/turn/create
-Serving POST /agentic_system/session/get
+	Serving POST /agentic_system/delete
-Serving POST /agentic_system/step/get
+	Serving POST /agentic_system/session/delete
-Serving POST /agentic_system/turn/get
+	Serving POST /agentic_system/session/get
-Serving GET /telemetry/get_trace
+	Serving POST /agentic_system/step/get
-Serving POST /telemetry/log_event
+	Serving POST /agentic_system/turn/get
-Listening on :::5000
+	Serving GET /telemetry/get_trace
-INFO:     Started server process [587053]
+	Serving POST /telemetry/log_event
-INFO:     Waiting for application startup.
+	Listening on :::5000
-INFO:     Application startup complete.
+	INFO:     Started server process [587053]
-INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
+	INFO:     Waiting for application startup.
-```
+	INFO:     Application startup complete.
-
+	INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
-### End-to-end flow of building, configuring, running, and testing a Distribution
+	```
-
+
-#### Step 1. Build
+
-In the following steps, imagine we'll be working with a `Meta-Llama3.1-8B-Instruct` model. We will name our build `8b-instruct` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify:
+## Testing with client
 - `name`: the name for our distribution (e.g. `8b-instruct`)
 - `image_type`: our build image type (`conda | docker`)
 - `distribution_spec`: our distribution specs for specifying API providers
  - `description`: a short description of the configurations for the distribution
  - `providers`: specifies the underlying implementation for serving each API endpoint
  - `image_type`: `conda` | `docker` to specify whether to build the distribution in the form of Docker image or Conda environment.
 At the end of build command, we will generate `<name>-build.yaml` file storing the build configurations.
 After this step is complete, a file named `<name>-build.yaml` will be generated and saved at the output file path specified at the end of the command.
 #### Building from scratch
 - For a new user, we could start off with running `llama stack build` which will allow you to a interactively enter wizard where you will be prompted to enter build configurations.
 ```
 llama stack build
 ```
 Running the command above will allow you to fill in the configuration to build your Llama Stack distribution, you will see the following outputs.
 ```
 > Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): 8b-instruct
 > Enter the image type you want your distribution to be built with (docker or conda): conda
 Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs.
 > Enter the API provider for the inference API: (default=meta-reference): meta-reference
 > Enter the API provider for the safety API: (default=meta-reference): meta-reference
 > Enter the API provider for the agents API: (default=meta-reference): meta-reference
 > Enter the API provider for the memory API: (default=meta-reference): meta-reference
 > Enter the API provider for the telemetry API: (default=meta-reference): meta-reference
 > (Optional) Enter a short description for your Llama Stack distribution:
 Build spec configuration saved at ~/.conda/envs/llamastack-my-local-llama-stack/8b-instruct-build.yaml
 ```
 **Ollama (optional)**
 If you plan to use Ollama for inference, you'll need to install the server [via these instructions](https://ollama.com/download).
 #### Building from templates
 - To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers.
 The following command will allow you to see the available templates and their corresponding providers.
 ```
 llama stack build --list-templates
 ```
 ![alt text](https://github.com/meta-llama/llama-stack/docs/resources/list-templates.png)
 You may then pick a template to build your distribution with providers fitted to your liking.
 ```
 llama stack build --template tgi
 ```
 ```
 $ llama stack build --template tgi
 ...
 ...
 Build spec configuration saved at ~/.conda/envs/llamastack-tgi/tgi-build.yaml
 You may now run `llama stack configure tgi` or `llama stack configure ~/.conda/envs/llamastack-tgi/tgi-build.yaml`
 ```
 #### Building from config file
 - In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command.
 - The config file will be of contents like the ones in `llama_stack/distributions/templates/`.
 ```
 $ cat llama_stack/templates/ollama/build.yaml
 name: ollama
 distribution_spec:
  description: Like local, but use ollama for running LLM inference
  providers:
    inference: remote::ollama
    memory: meta-reference
    safety: meta-reference
    agents: meta-reference
    telemetry: meta-reference
 image_type: conda
 ```
 ```
 llama stack build --config llama_stack/templates/ollama/build.yaml
 ```
 #### How to build distribution with Docker image
 > [!TIP]
 > Podman is supported as an alternative to Docker. Set `DOCKER_BINARY` to `podman` in your environment to use Podman.
 To build a docker image, you may start off from a template and use the `--image-type docker` flag to specify `docker` as the build image type.
 ```
 llama stack build --template tgi --image-type docker
 ```
 Alternatively, you may use a config file and set `image_type` to `docker` in our `<name>-build.yaml` file, and run `llama stack build <name>-build.yaml`. The `<name>-build.yaml` will be of contents like:
 ```
 name: local-docker-example
 distribution_spec:
  description: Use code from `llama_stack` itself to serve all llama stack APIs
  docker_image: null
  providers:
    inference: meta-reference
    memory: meta-reference-faiss
    safety: meta-reference
    agentic_system: meta-reference
    telemetry: console
 image_type: docker
 ```
 The following command allows you to build a Docker image with the name `<name>`
 ```
 llama stack build --config <name>-build.yaml
 Dockerfile created successfully in /tmp/tmp.I0ifS2c46A/DockerfileFROM python:3.10-slim
 WORKDIR /app
 ...
 ...
 You can run it with: podman run -p 8000:8000 llamastack-docker-local
 Build spec configuration saved at ~/.llama/distributions/docker/docker-local-build.yaml
 ```
 ### Step 2. Configure
 After our distribution is built (either in form of docker or conda environment), we will run the following command to
 ```
 llama stack configure [ <docker-image-name> | <path/to/name.build.yaml>]
 ```
 - For `conda` environments: <path/to/name.build.yaml> would be the generated build spec saved from Step 1.
 - For `docker` images downloaded from Dockerhub, you could also use <docker-image-name> as the argument.
   - Run `docker images` to check list of available images on your machine.
 ```
 $ llama stack configure tgi
 Configuring API: inference (meta-reference)
 Enter value for model (existing: Meta-Llama3.1-8B-Instruct) (required):
 Enter value for quantization (optional):
 Enter value for torch_seed (optional):
 Enter value for max_seq_len (existing: 4096) (required):
 Enter value for max_batch_size (existing: 1) (required):
 Configuring API: memory (meta-reference-faiss)
 Configuring API: safety (meta-reference)
 Do you want to configure llama_guard_shield? (y/n): y
 Entering sub-configuration for llama_guard_shield:
 Enter value for model (default: Llama-Guard-3-1B) (required):
 Enter value for excluded_categories (default: []) (required):
 Enter value for disable_input_check (default: False) (required):
 Enter value for disable_output_check (default: False) (required):
 Do you want to configure prompt_guard_shield? (y/n): y
 Entering sub-configuration for prompt_guard_shield:
 Enter value for model (default: Prompt-Guard-86M) (required):
 Configuring API: agentic_system (meta-reference)
 Enter value for brave_search_api_key (optional):
 Enter value for bing_search_api_key (optional):
 Enter value for wolfram_api_key (optional):
 Configuring API: telemetry (console)
 YAML configuration has been written to ~/.llama/builds/conda/tgi-run.yaml
 ```
 After this step is successful, you should be able to find a run configuration spec in `~/.llama/builds/conda/tgi-run.yaml` with the following contents. You may edit this file to change the settings.
 As you can see, we did basic configuration above and configured:
 - inference to run on model `Meta-Llama3.1-8B-Instruct` (obtained from `llama model list`)
 - Llama Guard safety shield with model `Llama-Guard-3-1B`
 - Prompt Guard safety shield with model `Prompt-Guard-86M`
 For how these configurations are stored as yaml, checkout the file printed at the end of the configuration.
 Note that all configurations as well as models are stored in `~/.llama`
 ### Step 3. Run
 Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack configure` step.
 ```
 llama stack run tgi
 ```
 You should see the Llama Stack server start and print the APIs that it is supporting
 ```
 $ llama stack run tgi
 > initializing model parallel with size 1
 > initializing ddp with size 1
 > initializing pipeline with size 1
 Loaded in 19.28 seconds
 NCCL version 2.20.5+cuda12.4
 Finished model load YES READY
 Serving POST /inference/batch_chat_completion
 Serving POST /inference/batch_completion
 Serving POST /inference/chat_completion
 Serving POST /inference/completion
 Serving POST /safety/run_shield
 Serving POST /agentic_system/memory_bank/attach
 Serving POST /agentic_system/create
 Serving POST /agentic_system/session/create
 Serving POST /agentic_system/turn/create
 Serving POST /agentic_system/delete
 Serving POST /agentic_system/session/delete
 Serving POST /agentic_system/memory_bank/detach
 Serving POST /agentic_system/session/get
 Serving POST /agentic_system/step/get
 Serving POST /agentic_system/turn/get
 Listening on :::5000
 INFO:     Started server process [453333]
 INFO:     Waiting for application startup.
 INFO:     Application startup complete.
 INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
 ```
 > [!NOTE]
 > Configuration is in `~/.llama/builds/local/conda/8b-instruct-run.yaml`. Feel free to increase `max_seq_len`.
 > [!IMPORTANT]
 > The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines.
 > [!TIP]
 > You might need to use the flag `--disable-ipv6` to  Disable IPv6 support
 This server is running a Llama model locally.
 ### Step 4. Test with Client
 Once the server is setup, we can test it with a client to see the example outputs.
 ```
 cd /path/to/llama-stack
@ -406,7 +180,7 @@ conda activate <env>  # any environment containing the llama-stack pip package w
 python -m llama_stack.apis.inference.client localhost 5000
 ```
-This will run the chat completion client and query the distribution’s /inference/chat_completion API.
+This will run the chat completion client and query the distribution’s `/inference/chat_completion` API.
 Here is an example output:
 ```
@ -417,6 +191,29 @@ The moon glows softly in the midnight sky,
 A beacon of wonder, as it passes by.
 ```
 You may also send a POST request to the server:
 ```
 curl http://localhost:5000/inference/chat_completion \
 -H "Content-Type: application/json" \
 -d '{
 	"model": "Llama3.1-8B-Instruct",
 	"messages": [
 		{"role": "system", "content": "You are a helpful assistant."},
 		{"role": "user", "content": "Write me a 2 sentence poem about the moon"}
 	],
 	"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
 }'
 Output:
 {'completion_message': {'role': 'assistant',
  'content': 'The moon glows softly in the midnight sky, \nA beacon of wonder, as it catches the eye.',
  'stop_reason': 'out_of_tokens',
  'tool_calls': []},
 'logprobs': null}
 ```
 Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by:
 ```
@ -427,3 +224,7 @@ python -m llama_stack.apis.safety.client localhost 5000
 Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications.
 You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
 ## Advanced Guides
 Please see our [Building a LLama Stack Distribution](./building_distro.md) guide for more details on how to assemble your own Llama Stack Distribution.
--- a/docs/source/providers_dev.md
+++ b/docs/source/providers_dev.md