diff --git a/docs/cli_reference.md b/docs/cli_reference.md deleted file mode 100644 index 39ac99615..000000000 --- a/docs/cli_reference.md +++ /dev/null @@ -1,485 +0,0 @@ -# Llama CLI Reference - -The `llama` CLI tool helps you setup and use the Llama Stack & agentic systems. It should be available on your path after installing the `llama-stack` package. - -### Subcommands -1. `download`: `llama` cli tools supports downloading the model from Meta or Hugging Face. -2. `model`: Lists available models and their properties. -3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this [here](cli_reference.md#step-3-building-and-configuring-llama-stack-distributions). - -### Sample Usage - -``` -llama --help -``` -
-usage: llama [-h] {download,model,stack} ...
-
-Welcome to the Llama CLI
-
-options:
-  -h, --help            show this help message and exit
-
-subcommands:
-  {download,model,stack}
-
- -## Step 1. Get the models - -You first need to have models downloaded locally. - -To download any model you need the **Model Descriptor**. -This can be obtained by running the command -``` -llama model list -``` - -You should see a table like this: - -
-+----------------------------------+------------------------------------------+----------------+
-| Model Descriptor                 | Hugging Face Repo                        | Context Length |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-8B                      | meta-llama/Llama-3.1-8B                  | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-70B                     | meta-llama/Llama-3.1-70B                 | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-405B:bf16-mp8           | meta-llama/Llama-3.1-405B                | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-405B                    | meta-llama/Llama-3.1-405B-FP8            | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-405B:bf16-mp16          | meta-llama/Llama-3.1-405B                | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-8B-Instruct             | meta-llama/Llama-3.1-8B-Instruct         | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-70B-Instruct            | meta-llama/Llama-3.1-70B-Instruct        | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-405B-Instruct:bf16-mp8  | meta-llama/Llama-3.1-405B-Instruct       | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-405B-Instruct           | meta-llama/Llama-3.1-405B-Instruct-FP8   | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.1-405B-Instruct:bf16-mp16 | meta-llama/Llama-3.1-405B-Instruct       | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-1B                      | meta-llama/Llama-3.2-1B                  | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-3B                      | meta-llama/Llama-3.2-3B                  | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-11B-Vision              | meta-llama/Llama-3.2-11B-Vision          | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-90B-Vision              | meta-llama/Llama-3.2-90B-Vision          | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-1B-Instruct             | meta-llama/Llama-3.2-1B-Instruct         | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-3B-Instruct             | meta-llama/Llama-3.2-3B-Instruct         | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-11B-Vision-Instruct     | meta-llama/Llama-3.2-11B-Vision-Instruct | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama3.2-90B-Vision-Instruct     | meta-llama/Llama-3.2-90B-Vision-Instruct | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama-Guard-3-11B-Vision         | meta-llama/Llama-Guard-3-11B-Vision      | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama-Guard-3-1B:int4-mp1        | meta-llama/Llama-Guard-3-1B-INT4         | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama-Guard-3-1B                 | meta-llama/Llama-Guard-3-1B              | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama-Guard-3-8B                 | meta-llama/Llama-Guard-3-8B              | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama-Guard-3-8B:int8-mp1        | meta-llama/Llama-Guard-3-8B-INT8         | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Prompt-Guard-86M                 | meta-llama/Prompt-Guard-86M              | 128K           |
-+----------------------------------+------------------------------------------+----------------+
-| Llama-Guard-2-8B                 | meta-llama/Llama-Guard-2-8B              | 4K             |
-+----------------------------------+------------------------------------------+----------------+
-
- -To download models, you can use the llama download command. - -#### Downloading from [Meta](https://llama.meta.com/llama-downloads/) - -Here is an example download command to get the 3B-Instruct/11B-Vision-Instruct model. You will need META_URL which can be obtained from [here](https://llama.meta.com/docs/getting_the_models/meta/) - -Download the required checkpoints using the following commands: -```bash -# download the 8B model, this can be run on a single GPU -llama download --source meta --model-id Llama3.2-3B-Instruct --meta-url META_URL - -# you can also get the 70B model, this will require 8 GPUs however -llama download --source meta --model-id Llama3.2-11B-Vision-Instruct --meta-url META_URL - -# llama-agents have safety enabled by default. For this, you will need -# safety models -- Llama-Guard and Prompt-Guard -llama download --source meta --model-id Prompt-Guard-86M --meta-url META_URL -llama download --source meta --model-id Llama-Guard-3-1B --meta-url META_URL -``` - -#### Downloading from [Hugging Face](https://huggingface.co/meta-llama) - -Essentially, the same commands above work, just replace `--source meta` with `--source huggingface`. - -```bash -llama download --source huggingface --model-id Llama3.1-8B-Instruct --hf-token - -llama download --source huggingface --model-id Llama3.1-70B-Instruct --hf-token - -llama download --source huggingface --model-id Llama-Guard-3-1B --ignore-patterns *original* -llama download --source huggingface --model-id Prompt-Guard-86M --ignore-patterns *original* -``` - -**Important:** Set your environment variable `HF_TOKEN` or pass in `--hf-token` to the command to validate your access. You can find your token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens). - -> **Tip:** Default for `llama download` is to run with `--ignore-patterns *.safetensors` since we use the `.pth` files in the `original` folder. For Llama Guard and Prompt Guard, however, we need safetensors. Hence, please run with `--ignore-patterns original` so that safetensors are downloaded and `.pth` files are ignored. - -#### Downloading via Ollama - -If you're already using ollama, we also have a supported Llama Stack distribution `local-ollama` and you can continue to use ollama for managing model downloads. - -``` -ollama pull llama3.1:8b-instruct-fp16 -ollama pull llama3.1:70b-instruct-fp16 -``` - -> [!NOTE] -> Only the above two models are currently supported by Ollama. - - -## Step 2: Understand the models -The `llama model` command helps you explore the model’s interface. - -### 2.1 Subcommands -1. `download`: Download the model from different sources. (meta, huggingface) -2. `list`: Lists all the models available for download with hardware requirements to deploy the models. -3. `prompt-format`: Show llama model message formats. -4. `describe`: Describes all the properties of the model. - -### 2.2 Sample Usage - -`llama model ` - -``` -llama model --help -``` -
-usage: llama model [-h] {download,list,prompt-format,describe} ...
-
-Work with llama models
-
-options:
-  -h, --help            show this help message and exit
-
-model_subcommands:
-  {download,list,prompt-format,describe}
-
- -You can use the describe command to know more about a model: -``` -llama model describe -m Llama3.2-3B-Instruct -``` -### 2.3 Describe - -
-+-----------------------------+----------------------------------+
-| Model                       | Llama3.2-3B-Instruct             |
-+-----------------------------+----------------------------------+
-| Hugging Face ID             | meta-llama/Llama-3.2-3B-Instruct |
-+-----------------------------+----------------------------------+
-| Description                 | Llama 3.2 3b instruct model      |
-+-----------------------------+----------------------------------+
-| Context Length              | 128K tokens                      |
-+-----------------------------+----------------------------------+
-| Weights format              | bf16                             |
-+-----------------------------+----------------------------------+
-| Model params.json           | {                                |
-|                             |     "dim": 3072,                 |
-|                             |     "n_layers": 28,              |
-|                             |     "n_heads": 24,               |
-|                             |     "n_kv_heads": 8,             |
-|                             |     "vocab_size": 128256,        |
-|                             |     "ffn_dim_multiplier": 1.0,   |
-|                             |     "multiple_of": 256,          |
-|                             |     "norm_eps": 1e-05,           |
-|                             |     "rope_theta": 500000.0,      |
-|                             |     "use_scaled_rope": true      |
-|                             | }                                |
-+-----------------------------+----------------------------------+
-| Recommended sampling params | {                                |
-|                             |     "strategy": "top_p",         |
-|                             |     "temperature": 1.0,          |
-|                             |     "top_p": 0.9,                |
-|                             |     "top_k": 0                   |
-|                             | }                                |
-+-----------------------------+----------------------------------+
-
-### 2.4 Prompt Format -You can even run `llama model prompt-format` see all of the templates and their tokens: - -``` -llama model prompt-format -m Llama3.2-3B-Instruct -``` -![alt text](resources/prompt-format.png) - - - -You will be shown a Markdown formatted description of the model interface and how prompts / messages are formatted for various scenarios. - -**NOTE**: Outputs in terminal are color printed to show special tokens. - - -## Step 3: Building, and Configuring Llama Stack Distributions - -- Please see our [Getting Started](getting_started.md) guide for more details on how to build and start a Llama Stack distribution. - -### Step 3.1 Build -In the following steps, imagine we'll be working with a `Llama3.1-8B-Instruct` model. We will name our build `tgi` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify: -- `name`: the name for our distribution (e.g. `tgi`) -- `image_type`: our build image type (`conda | docker`) -- `distribution_spec`: our distribution specs for specifying API providers - - `description`: a short description of the configurations for the distribution - - `providers`: specifies the underlying implementation for serving each API endpoint - - `image_type`: `conda` | `docker` to specify whether to build the distribution in the form of Docker image or Conda environment. - - -At the end of build command, we will generate `-build.yaml` file storing the build configurations. - -After this step is complete, a file named `-build.yaml` will be generated and saved at the output file path specified at the end of the command. - -#### Building from scratch -- For a new user, we could start off with running `llama stack build` which will allow you to a interactively enter wizard where you will be prompted to enter build configurations. -``` -llama stack build -``` - -Running the command above will allow you to fill in the configuration to build your Llama Stack distribution, you will see the following outputs. - -``` -> Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local-llama-stack -> Enter the image type you want your distribution to be built with (docker or conda): conda - - Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs. -> Enter the API provider for the inference API: (default=meta-reference): meta-reference -> Enter the API provider for the safety API: (default=meta-reference): meta-reference -> Enter the API provider for the agents API: (default=meta-reference): meta-reference -> Enter the API provider for the memory API: (default=meta-reference): meta-reference -> Enter the API provider for the telemetry API: (default=meta-reference): meta-reference - - > (Optional) Enter a short description for your Llama Stack distribution: - -Build spec configuration saved at ~/.conda/envs/llamastack-my-local-llama-stack/my-local-llama-stack-build.yaml -``` - -#### Building from templates -- To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers. - -The following command will allow you to see the available templates and their corresponding providers. -``` -llama stack build --list-templates -``` - -![alt text](resources/list-templates.png) - -You may then pick a template to build your distribution with providers fitted to your liking. - -``` -llama stack build --template tgi --image-type conda -``` - -``` -$ llama stack build --template tgi --image-type conda -... -... -Build spec configuration saved at ~/.conda/envs/llamastack-tgi/tgi-build.yaml -You may now run `llama stack configure tgi` or `llama stack configure ~/.conda/envs/llamastack-tgi/tgi-build.yaml` -``` - -#### Building from config file -- In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command. - -- The config file will be of contents like the ones in `llama_stack/templates/`. - -``` -$ cat build.yaml - -name: ollama -distribution_spec: - description: Like local, but use ollama for running LLM inference - providers: - inference: remote::ollama - memory: meta-reference - safety: meta-reference - agents: meta-reference - telemetry: meta-reference -image_type: conda -``` - -``` -llama stack build --config build.yaml -``` - -#### How to build distribution with Docker image - -To build a docker image, you may start off from a template and use the `--image-type docker` flag to specify `docker` as the build image type. - -``` -llama stack build --template tgi --image-type docker -``` - -Alternatively, you may use a config file and set `image_type` to `docker` in our `-build.yaml` file, and run `llama stack build -build.yaml`. The `-build.yaml` will be of contents like: - -``` -name: local-docker-example -distribution_spec: - description: Use code from `llama_stack` itself to serve all llama stack APIs - docker_image: null - providers: - inference: meta-reference - memory: meta-reference-faiss - safety: meta-reference - agentic_system: meta-reference - telemetry: console -image_type: docker -``` - -The following command allows you to build a Docker image with the name `` -``` -llama stack build --config -build.yaml - -Dockerfile created successfully in /tmp/tmp.I0ifS2c46A/DockerfileFROM python:3.10-slim -WORKDIR /app -... -... -You can run it with: podman run -p 8000:8000 llamastack-docker-local -Build spec configuration saved at ~/.llama/distributions/docker/docker-local-build.yaml -``` - - -### Step 3.2 Configure -After our distribution is built (either in form of docker or conda environment), we will run the following command to -``` -llama stack configure [ | ] -``` -- For `conda` environments: would be the generated build spec saved from Step 1. -- For `docker` images downloaded from Dockerhub, you could also use as the argument. - - Run `docker images` to check list of available images on your machine. - -``` -$ llama stack configure ~/.llama/distributions/conda/tgi-build.yaml - -Configuring API: inference (meta-reference) -Enter value for model (existing: Llama3.1-8B-Instruct) (required): -Enter value for quantization (optional): -Enter value for torch_seed (optional): -Enter value for max_seq_len (existing: 4096) (required): -Enter value for max_batch_size (existing: 1) (required): - -Configuring API: memory (meta-reference-faiss) - -Configuring API: safety (meta-reference) -Do you want to configure llama_guard_shield? (y/n): y -Entering sub-configuration for llama_guard_shield: -Enter value for model (default: Llama-Guard-3-1B) (required): -Enter value for excluded_categories (default: []) (required): -Enter value for disable_input_check (default: False) (required): -Enter value for disable_output_check (default: False) (required): -Do you want to configure prompt_guard_shield? (y/n): y -Entering sub-configuration for prompt_guard_shield: -Enter value for model (default: Prompt-Guard-86M) (required): - -Configuring API: agentic_system (meta-reference) -Enter value for brave_search_api_key (optional): -Enter value for bing_search_api_key (optional): -Enter value for wolfram_api_key (optional): - -Configuring API: telemetry (console) - -YAML configuration has been written to ~/.llama/builds/conda/8b-instruct-run.yaml -``` - -After this step is successful, you should be able to find a run configuration spec in `~/.llama/builds/conda/8b-instruct-run.yaml` with the following contents. You may edit this file to change the settings. - -As you can see, we did basic configuration above and configured: -- inference to run on model `Llama3.1-8B-Instruct` (obtained from `llama model list`) -- Llama Guard safety shield with model `Llama-Guard-3-1B` -- Prompt Guard safety shield with model `Prompt-Guard-86M` - -For how these configurations are stored as yaml, checkout the file printed at the end of the configuration. - -Note that all configurations as well as models are stored in `~/.llama` - - -### Step 3.3 Run -Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack configure` step. - -``` -llama stack run ~/.llama/builds/conda/tgi-run.yaml -``` - -You should see the Llama Stack server start and print the APIs that it is supporting - -``` -$ llama stack run ~/.llama/builds/local/conda/tgi-run.yaml - -> initializing model parallel with size 1 -> initializing ddp with size 1 -> initializing pipeline with size 1 -Loaded in 19.28 seconds -NCCL version 2.20.5+cuda12.4 -Finished model load YES READY -Serving POST /inference/batch_chat_completion -Serving POST /inference/batch_completion -Serving POST /inference/chat_completion -Serving POST /inference/completion -Serving POST /safety/run_shield -Serving POST /agentic_system/memory_bank/attach -Serving POST /agentic_system/create -Serving POST /agentic_system/session/create -Serving POST /agentic_system/turn/create -Serving POST /agentic_system/delete -Serving POST /agentic_system/session/delete -Serving POST /agentic_system/memory_bank/detach -Serving POST /agentic_system/session/get -Serving POST /agentic_system/step/get -Serving POST /agentic_system/turn/get -Listening on :::5000 -INFO: Started server process [453333] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) -``` - -> [!NOTE] -> Configuration is in `~/.llama/builds/local/conda/tgi-run.yaml`. Feel free to increase `max_seq_len`. - -> [!IMPORTANT] -> The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines. - -> [!TIP] -> You might need to use the flag `--disable-ipv6` to Disable IPv6 support - -This server is running a Llama model locally. - -### Step 3.4 Test with Client -Once the server is setup, we can test it with a client to see the example outputs. -``` -cd /path/to/llama-stack -conda activate # any environment containing the llama-stack pip package will work - -python -m llama_stack.apis.inference.client localhost 5000 -``` - -This will run the chat completion client and query the distribution’s /inference/chat_completion API. - -Here is an example output: -``` -User>hello world, write me a 2 sentence poem about the moon -Assistant> Here's a 2-sentence poem about the moon: - -The moon glows softly in the midnight sky, -A beacon of wonder, as it passes by. -``` - -Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by: - -``` -python -m llama_stack.apis.safety.client localhost 5000 -``` - -You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo. diff --git a/docs/getting_started.md b/docs/getting_started.md deleted file mode 100644 index 49c7cd5a0..000000000 --- a/docs/getting_started.md +++ /dev/null @@ -1,230 +0,0 @@ -# Getting Started with Llama Stack - -This guide will walk you though the steps to get started on end-to-end flow for LlamaStack. This guide mainly focuses on getting started with building a LlamaStack distribution, and starting up a LlamaStack server. Please see our [documentations](../README.md) on what you can do with Llama Stack, and [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) on examples apps built with Llama Stack. - -## Installation -The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-stack` package. - -You have two ways to install this repository: - -1. **Install as a package**: - You can install the repository directly from [PyPI](https://pypi.org/project/llama-stack/) by running the following command: - ```bash - pip install llama-stack - ``` - -2. **Install from source**: - If you prefer to install from the source code, follow these steps: - ```bash - mkdir -p ~/local - cd ~/local - git clone git@github.com:meta-llama/llama-stack.git - - conda create -n stack python=3.10 - conda activate stack - - cd llama-stack - $CONDA_PREFIX/bin/pip install -e . - ``` - -For what you can do with the Llama CLI, please refer to [CLI Reference](./cli_reference.md). - -## Starting Up Llama Stack Server - -You have two ways to start up Llama stack server: - -1. **Starting up server via docker**: - -We provide pre-built Docker image of Llama Stack distribution, which can be found in the following links in the [distributions](../distributions/) folder. - -> [!NOTE] -> For GPU inference, you need to set these environment variables for specifying local directory containing your model checkpoints, and enable GPU inference to start running docker container. -``` -export LLAMA_CHECKPOINT_DIR=~/.llama -``` - -> [!NOTE] -> `~/.llama` should be the path containing downloaded weights of Llama models. - -To download llama models, use -``` -llama download --model-id Llama3.1-8B-Instruct -``` - -To download and start running a pre-built docker container, you may use the following commands: - -``` -cd llama-stack/distributions/meta-reference-gpu -docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml -``` - -> [!TIP] -> Pro Tip: We may use `docker compose up` for starting up a distribution with remote providers (e.g. TGI) using [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general). You can checkout [these scripts](../distributions/) to help you get started. - - -2. **Build->Configure->Run Llama Stack server via conda**: - - You may also build a LlamaStack distribution from scratch, configure it, and start running the distribution. This is useful for developing on LlamaStack. - - **`llama stack build`** - - You'll be prompted to enter build information interactively. - ``` - llama stack build - - > Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local-stack - > Enter the image type you want your distribution to be built with (docker or conda): conda - - Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs. - > Enter the API provider for the inference API: (default=meta-reference): meta-reference - > Enter the API provider for the safety API: (default=meta-reference): meta-reference - > Enter the API provider for the agents API: (default=meta-reference): meta-reference - > Enter the API provider for the memory API: (default=meta-reference): meta-reference - > Enter the API provider for the telemetry API: (default=meta-reference): meta-reference - - > (Optional) Enter a short description for your Llama Stack distribution: - - Build spec configuration saved at ~/.conda/envs/llamastack-my-local-stack/my-local-stack-build.yaml - You can now run `llama stack configure my-local-stack` - ``` - - **`llama stack configure`** - - Run `llama stack configure ` with the name you have previously defined in `build` step. - ``` - llama stack configure - ``` - - You will be prompted to enter configurations for your Llama Stack - - ``` - $ llama stack configure my-local-stack - - Configuring API `inference`... - === Configuring provider `meta-reference` for API inference... - Enter value for model (default: Llama3.1-8B-Instruct) (required): - Do you want to configure quantization? (y/n): n - Enter value for torch_seed (optional): - Enter value for max_seq_len (default: 4096) (required): - Enter value for max_batch_size (default: 1) (required): - - Configuring API `safety`... - === Configuring provider `meta-reference` for API safety... - Do you want to configure llama_guard_shield? (y/n): n - Do you want to configure prompt_guard_shield? (y/n): n - - Configuring API `agents`... - === Configuring provider `meta-reference` for API agents... - Enter `type` for persistence_store (options: redis, sqlite, postgres) (default: sqlite): - - Configuring SqliteKVStoreConfig: - Enter value for namespace (optional): - Enter value for db_path (default: /home/xiyan/.llama/runtime/kvstore.db) (required): - - Configuring API `memory`... - === Configuring provider `meta-reference` for API memory... - > Please enter the supported memory bank type your provider has for memory: vector - - Configuring API `telemetry`... - === Configuring provider `meta-reference` for API telemetry... - - > YAML configuration has been written to ~/.llama/builds/conda/my-local-stack-run.yaml. - You can now run `llama stack run my-local-stack --port PORT` - ``` - - **`llama stack run`** - - Run `llama stack run ` with the name you have previously defined. - ``` - llama stack run my-local-stack - - ... - > initializing model parallel with size 1 - > initializing ddp with size 1 - > initializing pipeline with size 1 - ... - Finished model load YES READY - Serving POST /inference/chat_completion - Serving POST /inference/completion - Serving POST /inference/embeddings - Serving POST /memory_banks/create - Serving DELETE /memory_bank/documents/delete - Serving DELETE /memory_banks/drop - Serving GET /memory_bank/documents/get - Serving GET /memory_banks/get - Serving POST /memory_bank/insert - Serving GET /memory_banks/list - Serving POST /memory_bank/query - Serving POST /memory_bank/update - Serving POST /safety/run_shield - Serving POST /agentic_system/create - Serving POST /agentic_system/session/create - Serving POST /agentic_system/turn/create - Serving POST /agentic_system/delete - Serving POST /agentic_system/session/delete - Serving POST /agentic_system/session/get - Serving POST /agentic_system/step/get - Serving POST /agentic_system/turn/get - Serving GET /telemetry/get_trace - Serving POST /telemetry/log_event - Listening on :::5000 - INFO: Started server process [587053] - INFO: Waiting for application startup. - INFO: Application startup complete. - INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) - ``` - - -## Testing with client -Once the server is setup, we can test it with a client to see the example outputs. -``` -cd /path/to/llama-stack -conda activate # any environment containing the llama-stack pip package will work - -python -m llama_stack.apis.inference.client localhost 5000 -``` - -This will run the chat completion client and query the distribution’s `/inference/chat_completion` API. - -Here is an example output: -``` -User>hello world, write me a 2 sentence poem about the moon -Assistant> Here's a 2-sentence poem about the moon: - -The moon glows softly in the midnight sky, -A beacon of wonder, as it passes by. -``` - -You may also send a POST request to the server: -``` -curl http://localhost:5000/inference/chat_completion \ --H "Content-Type: application/json" \ --d '{ - "model": "Llama3.1-8B-Instruct", - "messages": [ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "Write me a 2 sentence poem about the moon"} - ], - "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512} -}' - -Output: -{'completion_message': {'role': 'assistant', - 'content': 'The moon glows softly in the midnight sky, \nA beacon of wonder, as it catches the eye.', - 'stop_reason': 'out_of_tokens', - 'tool_calls': []}, - 'logprobs': null} - -``` - - -Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by: - -``` -python -m llama_stack.apis.safety.client localhost 5000 -``` - - -Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications. - -You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo. - - -## Advanced Guides -Please see our [Building a LLama Stack Distribution](./building_distro.md) guide for more details on how to assemble your own Llama Stack Distribution. diff --git a/docs/building_distro.md b/docs/source/building_distro.md similarity index 100% rename from docs/building_distro.md rename to docs/source/building_distro.md diff --git a/docs/source/cli_reference.md b/docs/source/cli_reference.md index 81da1a773..39ac99615 100644 --- a/docs/source/cli_reference.md +++ b/docs/source/cli_reference.md @@ -2,12 +2,12 @@ The `llama` CLI tool helps you setup and use the Llama Stack & agentic systems. It should be available on your path after installing the `llama-stack` package. -## Subcommands +### Subcommands 1. `download`: `llama` cli tools supports downloading the model from Meta or Hugging Face. 2. `model`: Lists available models and their properties. -3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this in Step 3 below. +3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this [here](cli_reference.md#step-3-building-and-configuring-llama-stack-distributions). -## Sample Usage +### Sample Usage ``` llama --help @@ -94,7 +94,7 @@ You should see a table like this: To download models, you can use the llama download command. -### Downloading from [Meta](https://llama.meta.com/llama-downloads/) +#### Downloading from [Meta](https://llama.meta.com/llama-downloads/) Here is an example download command to get the 3B-Instruct/11B-Vision-Instruct model. You will need META_URL which can be obtained from [here](https://llama.meta.com/docs/getting_the_models/meta/) @@ -112,7 +112,7 @@ llama download --source meta --model-id Prompt-Guard-86M --meta-url META_URL llama download --source meta --model-id Llama-Guard-3-1B --meta-url META_URL ``` -### Downloading from [Hugging Face](https://huggingface.co/meta-llama) +#### Downloading from [Hugging Face](https://huggingface.co/meta-llama) Essentially, the same commands above work, just replace `--source meta` with `--source huggingface`. @@ -129,7 +129,7 @@ llama download --source huggingface --model-id Prompt-Guard-86M --ignore-pattern > **Tip:** Default for `llama download` is to run with `--ignore-patterns *.safetensors` since we use the `.pth` files in the `original` folder. For Llama Guard and Prompt Guard, however, we need safetensors. Hence, please run with `--ignore-patterns original` so that safetensors are downloaded and `.pth` files are ignored. -### Downloading via Ollama +#### Downloading via Ollama If you're already using ollama, we also have a supported Llama Stack distribution `local-ollama` and you can continue to use ollama for managing model downloads. @@ -215,7 +215,7 @@ You can even run `llama model prompt-format` see all of the templates and their ``` llama model prompt-format -m Llama3.2-3B-Instruct ``` -![alt text](https://github.com/meta-llama/llama-stack/docs/resources/prompt-format.png) +![alt text](resources/prompt-format.png) @@ -229,8 +229,8 @@ You will be shown a Markdown formatted description of the model interface and ho - Please see our [Getting Started](getting_started.md) guide for more details on how to build and start a Llama Stack distribution. ### Step 3.1 Build -In the following steps, imagine we'll be working with a `Llama3.1-8B-Instruct` model. We will name our build `8b-instruct` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify: -- `name`: the name for our distribution (e.g. `8b-instruct`) +In the following steps, imagine we'll be working with a `Llama3.1-8B-Instruct` model. We will name our build `tgi` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify: +- `name`: the name for our distribution (e.g. `tgi`) - `image_type`: our build image type (`conda | docker`) - `distribution_spec`: our distribution specs for specifying API providers - `description`: a short description of the configurations for the distribution @@ -274,16 +274,16 @@ The following command will allow you to see the available templates and their co llama stack build --list-templates ``` -![alt text](https://github.com/meta-llama/llama-stack/docs/resources/list-templates.png) +![alt text](resources/list-templates.png) You may then pick a template to build your distribution with providers fitted to your liking. ``` -llama stack build --template tgi +llama stack build --template tgi --image-type conda ``` ``` -$ llama stack build --template tgi +$ llama stack build --template tgi --image-type conda ... ... Build spec configuration saved at ~/.conda/envs/llamastack-tgi/tgi-build.yaml @@ -293,10 +293,10 @@ You may now run `llama stack configure tgi` or `llama stack configure ~/.conda/e #### Building from config file - In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command. -- The config file will be of contents like the ones in `llama_stack/distributions/templates/`. +- The config file will be of contents like the ones in `llama_stack/templates/`. ``` -$ cat llama_stack/templates/ollama/build.yaml +$ cat build.yaml name: ollama distribution_spec: @@ -311,7 +311,7 @@ image_type: conda ``` ``` -llama stack build --config llama_stack/templates/ollama/build.yaml +llama stack build --config build.yaml ``` #### How to build distribution with Docker image @@ -319,7 +319,7 @@ llama stack build --config llama_stack/templates/ollama/build.yaml To build a docker image, you may start off from a template and use the `--image-type docker` flag to specify `docker` as the build image type. ``` -llama stack build --template local --image-type docker +llama stack build --template tgi --image-type docker ``` Alternatively, you may use a config file and set `image_type` to `docker` in our `-build.yaml` file, and run `llama stack build -build.yaml`. The `-build.yaml` will be of contents like: @@ -354,7 +354,7 @@ Build spec configuration saved at ~/.llama/distributions/docker/docker-local-bui ### Step 3.2 Configure After our distribution is built (either in form of docker or conda environment), we will run the following command to ``` -llama stack configure [ | ] +llama stack configure [ | ] ``` - For `conda` environments: would be the generated build spec saved from Step 1. - For `docker` images downloaded from Dockerhub, you could also use as the argument. @@ -390,10 +390,10 @@ Enter value for wolfram_api_key (optional): Configuring API: telemetry (console) -YAML configuration has been written to ~/.llama/builds/conda/tgi-run.yaml +YAML configuration has been written to ~/.llama/builds/conda/8b-instruct-run.yaml ``` -After this step is successful, you should be able to find a run configuration spec in `~/.llama/builds/conda/tgi-run.yaml` with the following contents. You may edit this file to change the settings. +After this step is successful, you should be able to find a run configuration spec in `~/.llama/builds/conda/8b-instruct-run.yaml` with the following contents. You may edit this file to change the settings. As you can see, we did basic configuration above and configured: - inference to run on model `Llama3.1-8B-Instruct` (obtained from `llama model list`) @@ -415,7 +415,7 @@ llama stack run ~/.llama/builds/conda/tgi-run.yaml You should see the Llama Stack server start and print the APIs that it is supporting ``` -$ llama stack run ~/.llama/builds/conda/tgi-run.yaml +$ llama stack run ~/.llama/builds/local/conda/tgi-run.yaml > initializing model parallel with size 1 > initializing ddp with size 1 diff --git a/docs/developer_cookbook.md b/docs/source/developer_cookbook.md similarity index 100% rename from docs/developer_cookbook.md rename to docs/source/developer_cookbook.md diff --git a/docs/source/getting_started.md b/docs/source/getting_started.md index b1450cd42..49c7cd5a0 100644 --- a/docs/source/getting_started.md +++ b/docs/source/getting_started.md @@ -1,37 +1,41 @@ -# Getting Started +# Getting Started with Llama Stack -This guide will walk you though the steps to get started on end-to-end flow for LlamaStack. This guide mainly focuses on getting started with building a LlamaStack distribution, and starting up a LlamaStack server. Please see our [documentations](https://github.com/meta-llama/llama-stack/README.md) on what you can do with Llama Stack, and [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) on examples apps built with Llama Stack. +This guide will walk you though the steps to get started on end-to-end flow for LlamaStack. This guide mainly focuses on getting started with building a LlamaStack distribution, and starting up a LlamaStack server. Please see our [documentations](../README.md) on what you can do with Llama Stack, and [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) on examples apps built with Llama Stack. ## Installation The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-stack` package. -You can install this repository as a [package](https://pypi.org/project/llama-stack/) with `pip install llama-stack` +You have two ways to install this repository: -If you want to install from source: +1. **Install as a package**: + You can install the repository directly from [PyPI](https://pypi.org/project/llama-stack/) by running the following command: + ```bash + pip install llama-stack + ``` -```bash -mkdir -p ~/local -cd ~/local -git clone git@github.com:meta-llama/llama-stack.git +2. **Install from source**: + If you prefer to install from the source code, follow these steps: + ```bash + mkdir -p ~/local + cd ~/local + git clone git@github.com:meta-llama/llama-stack.git -conda create -n stack python=3.10 -conda activate stack + conda create -n stack python=3.10 + conda activate stack -cd llama-stack -$CONDA_PREFIX/bin/pip install -e . -``` + cd llama-stack + $CONDA_PREFIX/bin/pip install -e . + ``` For what you can do with the Llama CLI, please refer to [CLI Reference](./cli_reference.md). -## Quick Starting Llama Stack Server +## Starting Up Llama Stack Server -### Starting up server via docker +You have two ways to start up Llama stack server: -We provide 2 pre-built Docker image of Llama Stack distribution, which can be found in the following links. -- [llamastack-local-gpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-gpu/general) - - This is a packaged version with our local meta-reference implementations, where you will be running inference locally with downloaded Llama model checkpoints. -- [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general) - - This is a lite version with remote inference where you can hook up to your favourite remote inference framework (e.g. ollama, fireworks, together, tgi) for running inference without GPU. +1. **Starting up server via docker**: + +We provide pre-built Docker image of Llama Stack distribution, which can be found in the following links in the [distributions](../distributions/) folder. > [!NOTE] > For GPU inference, you need to set these environment variables for specifying local directory containing your model checkpoints, and enable GPU inference to start running docker container. @@ -42,362 +46,132 @@ export LLAMA_CHECKPOINT_DIR=~/.llama > [!NOTE] > `~/.llama` should be the path containing downloaded weights of Llama models. +To download llama models, use +``` +llama download --model-id Llama3.1-8B-Instruct +``` To download and start running a pre-built docker container, you may use the following commands: ``` -docker run -it -p 5000:5000 -v ~/.llama:/root/.llama --gpus=all llamastack/llamastack-local-gpu +cd llama-stack/distributions/meta-reference-gpu +docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml ``` > [!TIP] -> Pro Tip: We may use `docker compose up` for starting up a distribution with remote providers (e.g. TGI) using [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general). You can checkout [these scripts](https://github.com/meta-llama/llama-stack/llama_stack/distribution/docker/README.md) to help you get started. - -### Build->Configure->Run Llama Stack server via conda -You may also build a LlamaStack distribution from scratch, configure it, and start running the distribution. This is useful for developing on LlamaStack. - -**`llama stack build`** -- You'll be prompted to enter build information interactively. -``` -llama stack build - -> Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local-stack -> Enter the image type you want your distribution to be built with (docker or conda): conda - - Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs. -> Enter the API provider for the inference API: (default=meta-reference): meta-reference -> Enter the API provider for the safety API: (default=meta-reference): meta-reference -> Enter the API provider for the agents API: (default=meta-reference): meta-reference -> Enter the API provider for the memory API: (default=meta-reference): meta-reference -> Enter the API provider for the telemetry API: (default=meta-reference): meta-reference - - > (Optional) Enter a short description for your Llama Stack distribution: - -Build spec configuration saved at ~/.conda/envs/llamastack-my-local-stack/my-local-stack-build.yaml -You can now run `llama stack configure my-local-stack` -``` - -**`llama stack configure`** -- Run `llama stack configure ` with the name you have previously defined in `build` step. -``` -llama stack configure -``` -- You will be prompted to enter configurations for your Llama Stack - -``` -$ llama stack configure my-local-stack - -Configuring API `inference`... -=== Configuring provider `meta-reference` for API inference... -Enter value for model (default: Llama3.1-8B-Instruct) (required): -Do you want to configure quantization? (y/n): n -Enter value for torch_seed (optional): -Enter value for max_seq_len (default: 4096) (required): -Enter value for max_batch_size (default: 1) (required): - -Configuring API `safety`... -=== Configuring provider `meta-reference` for API safety... -Do you want to configure llama_guard_shield? (y/n): n -Do you want to configure prompt_guard_shield? (y/n): n - -Configuring API `agents`... -=== Configuring provider `meta-reference` for API agents... -Enter `type` for persistence_store (options: redis, sqlite, postgres) (default: sqlite): - -Configuring SqliteKVStoreConfig: -Enter value for namespace (optional): -Enter value for db_path (default: /home/xiyan/.llama/runtime/kvstore.db) (required): - -Configuring API `memory`... -=== Configuring provider `meta-reference` for API memory... -> Please enter the supported memory bank type your provider has for memory: vector - -Configuring API `telemetry`... -=== Configuring provider `meta-reference` for API telemetry... - -> YAML configuration has been written to ~/.llama/builds/conda/my-local-stack-run.yaml. -You can now run `llama stack run my-local-stack --port PORT` -``` - -**`llama stack run`** -- Run `llama stack run ` with the name you have previously defined. -``` -llama stack run my-local-stack - -... -> initializing model parallel with size 1 -> initializing ddp with size 1 -> initializing pipeline with size 1 -... -Finished model load YES READY -Serving POST /inference/chat_completion -Serving POST /inference/completion -Serving POST /inference/embeddings -Serving POST /memory_banks/create -Serving DELETE /memory_bank/documents/delete -Serving DELETE /memory_banks/drop -Serving GET /memory_bank/documents/get -Serving GET /memory_banks/get -Serving POST /memory_bank/insert -Serving GET /memory_banks/list -Serving POST /memory_bank/query -Serving POST /memory_bank/update -Serving POST /safety/run_shield -Serving POST /agentic_system/create -Serving POST /agentic_system/session/create -Serving POST /agentic_system/turn/create -Serving POST /agentic_system/delete -Serving POST /agentic_system/session/delete -Serving POST /agentic_system/session/get -Serving POST /agentic_system/step/get -Serving POST /agentic_system/turn/get -Serving GET /telemetry/get_trace -Serving POST /telemetry/log_event -Listening on :::5000 -INFO: Started server process [587053] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) -``` - -### End-to-end flow of building, configuring, running, and testing a Distribution - -#### Step 1. Build -In the following steps, imagine we'll be working with a `Meta-Llama3.1-8B-Instruct` model. We will name our build `8b-instruct` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify: -- `name`: the name for our distribution (e.g. `8b-instruct`) -- `image_type`: our build image type (`conda | docker`) -- `distribution_spec`: our distribution specs for specifying API providers - - `description`: a short description of the configurations for the distribution - - `providers`: specifies the underlying implementation for serving each API endpoint - - `image_type`: `conda` | `docker` to specify whether to build the distribution in the form of Docker image or Conda environment. - - -At the end of build command, we will generate `-build.yaml` file storing the build configurations. - -After this step is complete, a file named `-build.yaml` will be generated and saved at the output file path specified at the end of the command. - -#### Building from scratch -- For a new user, we could start off with running `llama stack build` which will allow you to a interactively enter wizard where you will be prompted to enter build configurations. -``` -llama stack build -``` - -Running the command above will allow you to fill in the configuration to build your Llama Stack distribution, you will see the following outputs. - -``` -> Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): 8b-instruct -> Enter the image type you want your distribution to be built with (docker or conda): conda - - Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs. -> Enter the API provider for the inference API: (default=meta-reference): meta-reference -> Enter the API provider for the safety API: (default=meta-reference): meta-reference -> Enter the API provider for the agents API: (default=meta-reference): meta-reference -> Enter the API provider for the memory API: (default=meta-reference): meta-reference -> Enter the API provider for the telemetry API: (default=meta-reference): meta-reference - - > (Optional) Enter a short description for your Llama Stack distribution: - -Build spec configuration saved at ~/.conda/envs/llamastack-my-local-llama-stack/8b-instruct-build.yaml -``` - -**Ollama (optional)** - -If you plan to use Ollama for inference, you'll need to install the server [via these instructions](https://ollama.com/download). - - -#### Building from templates -- To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers. - -The following command will allow you to see the available templates and their corresponding providers. -``` -llama stack build --list-templates -``` - -![alt text](https://github.com/meta-llama/llama-stack/docs/resources/list-templates.png) - -You may then pick a template to build your distribution with providers fitted to your liking. - -``` -llama stack build --template tgi -``` - -``` -$ llama stack build --template tgi -... -... -Build spec configuration saved at ~/.conda/envs/llamastack-tgi/tgi-build.yaml -You may now run `llama stack configure tgi` or `llama stack configure ~/.conda/envs/llamastack-tgi/tgi-build.yaml` -``` - -#### Building from config file -- In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command. - -- The config file will be of contents like the ones in `llama_stack/distributions/templates/`. - -``` -$ cat llama_stack/templates/ollama/build.yaml - -name: ollama -distribution_spec: - description: Like local, but use ollama for running LLM inference - providers: - inference: remote::ollama - memory: meta-reference - safety: meta-reference - agents: meta-reference - telemetry: meta-reference -image_type: conda -``` - -``` -llama stack build --config llama_stack/templates/ollama/build.yaml -``` - -#### How to build distribution with Docker image - -> [!TIP] -> Podman is supported as an alternative to Docker. Set `DOCKER_BINARY` to `podman` in your environment to use Podman. - -To build a docker image, you may start off from a template and use the `--image-type docker` flag to specify `docker` as the build image type. - -``` -llama stack build --template tgi --image-type docker -``` - -Alternatively, you may use a config file and set `image_type` to `docker` in our `-build.yaml` file, and run `llama stack build -build.yaml`. The `-build.yaml` will be of contents like: - -``` -name: local-docker-example -distribution_spec: - description: Use code from `llama_stack` itself to serve all llama stack APIs - docker_image: null - providers: - inference: meta-reference - memory: meta-reference-faiss - safety: meta-reference - agentic_system: meta-reference - telemetry: console -image_type: docker -``` - -The following command allows you to build a Docker image with the name `` -``` -llama stack build --config -build.yaml - -Dockerfile created successfully in /tmp/tmp.I0ifS2c46A/DockerfileFROM python:3.10-slim -WORKDIR /app -... -... -You can run it with: podman run -p 8000:8000 llamastack-docker-local -Build spec configuration saved at ~/.llama/distributions/docker/docker-local-build.yaml -``` - - -### Step 2. Configure -After our distribution is built (either in form of docker or conda environment), we will run the following command to -``` -llama stack configure [ | ] -``` -- For `conda` environments: would be the generated build spec saved from Step 1. -- For `docker` images downloaded from Dockerhub, you could also use as the argument. - - Run `docker images` to check list of available images on your machine. - -``` -$ llama stack configure tgi - -Configuring API: inference (meta-reference) -Enter value for model (existing: Meta-Llama3.1-8B-Instruct) (required): -Enter value for quantization (optional): -Enter value for torch_seed (optional): -Enter value for max_seq_len (existing: 4096) (required): -Enter value for max_batch_size (existing: 1) (required): - -Configuring API: memory (meta-reference-faiss) - -Configuring API: safety (meta-reference) -Do you want to configure llama_guard_shield? (y/n): y -Entering sub-configuration for llama_guard_shield: -Enter value for model (default: Llama-Guard-3-1B) (required): -Enter value for excluded_categories (default: []) (required): -Enter value for disable_input_check (default: False) (required): -Enter value for disable_output_check (default: False) (required): -Do you want to configure prompt_guard_shield? (y/n): y -Entering sub-configuration for prompt_guard_shield: -Enter value for model (default: Prompt-Guard-86M) (required): - -Configuring API: agentic_system (meta-reference) -Enter value for brave_search_api_key (optional): -Enter value for bing_search_api_key (optional): -Enter value for wolfram_api_key (optional): - -Configuring API: telemetry (console) - -YAML configuration has been written to ~/.llama/builds/conda/tgi-run.yaml -``` - -After this step is successful, you should be able to find a run configuration spec in `~/.llama/builds/conda/tgi-run.yaml` with the following contents. You may edit this file to change the settings. - -As you can see, we did basic configuration above and configured: -- inference to run on model `Meta-Llama3.1-8B-Instruct` (obtained from `llama model list`) -- Llama Guard safety shield with model `Llama-Guard-3-1B` -- Prompt Guard safety shield with model `Prompt-Guard-86M` - -For how these configurations are stored as yaml, checkout the file printed at the end of the configuration. - -Note that all configurations as well as models are stored in `~/.llama` - - -### Step 3. Run -Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack configure` step. - -``` -llama stack run tgi -``` - -You should see the Llama Stack server start and print the APIs that it is supporting - -``` -$ llama stack run tgi - -> initializing model parallel with size 1 -> initializing ddp with size 1 -> initializing pipeline with size 1 -Loaded in 19.28 seconds -NCCL version 2.20.5+cuda12.4 -Finished model load YES READY -Serving POST /inference/batch_chat_completion -Serving POST /inference/batch_completion -Serving POST /inference/chat_completion -Serving POST /inference/completion -Serving POST /safety/run_shield -Serving POST /agentic_system/memory_bank/attach -Serving POST /agentic_system/create -Serving POST /agentic_system/session/create -Serving POST /agentic_system/turn/create -Serving POST /agentic_system/delete -Serving POST /agentic_system/session/delete -Serving POST /agentic_system/memory_bank/detach -Serving POST /agentic_system/session/get -Serving POST /agentic_system/step/get -Serving POST /agentic_system/turn/get -Listening on :::5000 -INFO: Started server process [453333] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) -``` - -> [!NOTE] -> Configuration is in `~/.llama/builds/local/conda/8b-instruct-run.yaml`. Feel free to increase `max_seq_len`. - -> [!IMPORTANT] -> The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines. - -> [!TIP] -> You might need to use the flag `--disable-ipv6` to Disable IPv6 support - -This server is running a Llama model locally. - -### Step 4. Test with Client +> Pro Tip: We may use `docker compose up` for starting up a distribution with remote providers (e.g. TGI) using [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general). You can checkout [these scripts](../distributions/) to help you get started. + + +2. **Build->Configure->Run Llama Stack server via conda**: + + You may also build a LlamaStack distribution from scratch, configure it, and start running the distribution. This is useful for developing on LlamaStack. + + **`llama stack build`** + - You'll be prompted to enter build information interactively. + ``` + llama stack build + + > Enter an unique name for identifying your Llama Stack build distribution (e.g. my-local-stack): my-local-stack + > Enter the image type you want your distribution to be built with (docker or conda): conda + + Llama Stack is composed of several APIs working together. Let's configure the providers (implementations) you want to use for these APIs. + > Enter the API provider for the inference API: (default=meta-reference): meta-reference + > Enter the API provider for the safety API: (default=meta-reference): meta-reference + > Enter the API provider for the agents API: (default=meta-reference): meta-reference + > Enter the API provider for the memory API: (default=meta-reference): meta-reference + > Enter the API provider for the telemetry API: (default=meta-reference): meta-reference + + > (Optional) Enter a short description for your Llama Stack distribution: + + Build spec configuration saved at ~/.conda/envs/llamastack-my-local-stack/my-local-stack-build.yaml + You can now run `llama stack configure my-local-stack` + ``` + + **`llama stack configure`** + - Run `llama stack configure ` with the name you have previously defined in `build` step. + ``` + llama stack configure + ``` + - You will be prompted to enter configurations for your Llama Stack + + ``` + $ llama stack configure my-local-stack + + Configuring API `inference`... + === Configuring provider `meta-reference` for API inference... + Enter value for model (default: Llama3.1-8B-Instruct) (required): + Do you want to configure quantization? (y/n): n + Enter value for torch_seed (optional): + Enter value for max_seq_len (default: 4096) (required): + Enter value for max_batch_size (default: 1) (required): + + Configuring API `safety`... + === Configuring provider `meta-reference` for API safety... + Do you want to configure llama_guard_shield? (y/n): n + Do you want to configure prompt_guard_shield? (y/n): n + + Configuring API `agents`... + === Configuring provider `meta-reference` for API agents... + Enter `type` for persistence_store (options: redis, sqlite, postgres) (default: sqlite): + + Configuring SqliteKVStoreConfig: + Enter value for namespace (optional): + Enter value for db_path (default: /home/xiyan/.llama/runtime/kvstore.db) (required): + + Configuring API `memory`... + === Configuring provider `meta-reference` for API memory... + > Please enter the supported memory bank type your provider has for memory: vector + + Configuring API `telemetry`... + === Configuring provider `meta-reference` for API telemetry... + + > YAML configuration has been written to ~/.llama/builds/conda/my-local-stack-run.yaml. + You can now run `llama stack run my-local-stack --port PORT` + ``` + + **`llama stack run`** + - Run `llama stack run ` with the name you have previously defined. + ``` + llama stack run my-local-stack + + ... + > initializing model parallel with size 1 + > initializing ddp with size 1 + > initializing pipeline with size 1 + ... + Finished model load YES READY + Serving POST /inference/chat_completion + Serving POST /inference/completion + Serving POST /inference/embeddings + Serving POST /memory_banks/create + Serving DELETE /memory_bank/documents/delete + Serving DELETE /memory_banks/drop + Serving GET /memory_bank/documents/get + Serving GET /memory_banks/get + Serving POST /memory_bank/insert + Serving GET /memory_banks/list + Serving POST /memory_bank/query + Serving POST /memory_bank/update + Serving POST /safety/run_shield + Serving POST /agentic_system/create + Serving POST /agentic_system/session/create + Serving POST /agentic_system/turn/create + Serving POST /agentic_system/delete + Serving POST /agentic_system/session/delete + Serving POST /agentic_system/session/get + Serving POST /agentic_system/step/get + Serving POST /agentic_system/turn/get + Serving GET /telemetry/get_trace + Serving POST /telemetry/log_event + Listening on :::5000 + INFO: Started server process [587053] + INFO: Waiting for application startup. + INFO: Application startup complete. + INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) + ``` + + +## Testing with client Once the server is setup, we can test it with a client to see the example outputs. ``` cd /path/to/llama-stack @@ -406,7 +180,7 @@ conda activate # any environment containing the llama-stack pip package w python -m llama_stack.apis.inference.client localhost 5000 ``` -This will run the chat completion client and query the distribution’s /inference/chat_completion API. +This will run the chat completion client and query the distribution’s `/inference/chat_completion` API. Here is an example output: ``` @@ -417,6 +191,29 @@ The moon glows softly in the midnight sky, A beacon of wonder, as it passes by. ``` +You may also send a POST request to the server: +``` +curl http://localhost:5000/inference/chat_completion \ +-H "Content-Type: application/json" \ +-d '{ + "model": "Llama3.1-8B-Instruct", + "messages": [ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Write me a 2 sentence poem about the moon"} + ], + "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512} +}' + +Output: +{'completion_message': {'role': 'assistant', + 'content': 'The moon glows softly in the midnight sky, \nA beacon of wonder, as it catches the eye.', + 'stop_reason': 'out_of_tokens', + 'tool_calls': []}, + 'logprobs': null} + +``` + + Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by: ``` @@ -427,3 +224,7 @@ python -m llama_stack.apis.safety.client localhost 5000 Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications. You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo. + + +## Advanced Guides +Please see our [Building a LLama Stack Distribution](./building_distro.md) guide for more details on how to assemble your own Llama Stack Distribution. diff --git a/docs/new_api_provider.md b/docs/source/providers_dev.md similarity index 100% rename from docs/new_api_provider.md rename to docs/source/providers_dev.md