diff --git a/docs/cli_reference.md b/docs/cli_reference.md index 03b98a57d..41402d73e 100644 --- a/docs/cli_reference.md +++ b/docs/cli_reference.md @@ -2,10 +2,11 @@ The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-toolchain` package. -### Subcommands -1. `download`: `llama` cli tools supports downloading the model from Meta or HuggingFace. +### Subcommands +1. `download`: `llama` cli tools supports downloading the model from Meta or HuggingFace. 2. `model`: Lists available models and their properties. -3. `distribution`: A distribution is a set of REST APIs, this command allows you to manage (list, install, create, configure, start) distributions. You can read more about this [here](https://github.com/meta-llama/llama-stack/blob/main/docs/cli_reference.md#step-3-installing-and-configuring-distributions). +3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this [here](https://github.com/meta-llama/llama-stack/blob/main/docs/cli_reference.md#step-3-installing-and-configuring-distributions). +4. `api`: Allows you to build and run individual API providers (pieces) from the Llama Stack. ### Sample Usage @@ -13,7 +14,7 @@ The `llama` CLI tool helps you setup and use the Llama toolchain & agentic syste llama --help ```
-usage: llama [-h] {download,model,distribution} ... +usage: llama [-h] {download,model,stack,api} ... Welcome to the Llama CLI @@ -21,7 +22,7 @@ options: -h, --help show this help message and exit subcommands: - {download,model,distribution} + {download,model,stack,api}## Step 1. Get the models @@ -101,9 +102,9 @@ The `llama model` command helps you explore the model’s interface. ### 2.1 Subcommands 1. `download`: Download the model from different sources. (meta, huggingface) -2. `list`: Lists all the models available for download with hardware requirements to deploy the models. +2. `list`: Lists all the models available for download with hardware requirements to deploy the models. 3. `template`:
-+--------------+---------------------------------------------+----------------------------------------------------------------------+ -| Spec ID | ProviderSpecs | Description | -+--------------+---------------------------------------------+----------------------------------------------------------------------+ -| local | { | Use code from `llama_toolchain` itself to serve all llama stack APIs | -| | "inference": "meta-reference", | | -| | "safety": "meta-reference", | | -| | "agentic_system": "meta-reference" | | -| | } | | -+--------------+---------------------------------------------+----------------------------------------------------------------------+ -| remote | { | Point to remote services for all llama stack APIs | -| | "inference": "inference-remote", | | -| | "safety": "safety-remote", | | -| | "agentic_system": "agentic_system-remote" | | -| | } | | -+--------------+---------------------------------------------+----------------------------------------------------------------------+ -| local-ollama | { | Like local, but use ollama for running LLM inference | -| | "inference": "meta-ollama", | | -| | "safety": "meta-reference", | | -| | "agentic_system": "meta-reference" | | -| | } | | -+--------------+---------------------------------------------+----------------------------------------------------------------------+ +i+--------------------------------+---------------------------------------+----------------------------------------------------------------------+ +| Distribution ID | Providers | Description | ++--------------------------------+---------------------------------------+----------------------------------------------------------------------+ +| local | { | Use code from `llama_toolchain` itself to serve all llama stack APIs | +| | "inference": "meta-reference", | | +| | "memory": "meta-reference-faiss", | | +| | "safety": "meta-reference", | | +| | "agentic_system": "meta-reference" | | +| | } | | ++--------------------------------+---------------------------------------+----------------------------------------------------------------------+ +| remote | { | Point to remote services for all llama stack APIs | +| | "inference": "remote", | | +| | "safety": "remote", | | +| | "agentic_system": "remote", | | +| | "memory": "remote" | | +| | } | | ++--------------------------------+---------------------------------------+----------------------------------------------------------------------+ +| local-ollama | { | Like local, but use ollama for running LLM inference | +| | "inference": "remote::ollama", | | +| | "safety": "meta-reference", | | +| | "agentic_system": "meta-reference", | | +| | "memory": "meta-reference-faiss" | | +| | } | | ++--------------------------------+---------------------------------------+----------------------------------------------------------------------+ +| local-plus-fireworks-inference | { | Use Fireworks.ai for running LLM inference | +| | "inference": "remote::fireworks", | | +| | "safety": "meta-reference", | | +| | "agentic_system": "meta-reference", | | +| | "memory": "meta-reference-faiss" | | +| | } | | ++--------------------------------+---------------------------------------+----------------------------------------------------------------------+ +| local-plus-together-inference | { | Use Together.ai for running LLM inference | +| | "inference": "remote::together", | | +| | "safety": "meta-reference", | | +| | "agentic_system": "meta-reference", | | +| | "memory": "meta-reference-faiss" | | +| | } | | ++--------------------------------+---------------------------------------+----------------------------------------------------------------------+-As you can see above, each “spec” details the “providers” that make up that spec. For eg. The `local` spec uses the “meta-reference” provider for inference while the `local-ollama` spec relies on a different provider ( ollama ) for inference. +As you can see above, each “distribution” details the “providers” it is composed of. For example, `local` uses the “meta-reference” provider for inference while local-ollama relies on a different provider (Ollama) for inference. Similarly, you can use Fireworks or Together.AI for running inference as well. -Lets install the fully local implementation of the llama-stack – named `local` above. +To install a distribution, we run a simple command providing 2 inputs: +- **Distribution Id** of the distribution that we want to install ( as obtained from the list-distributions command ) +- A **Name** for the specific build and configuration of this distribution. -To install a distro, we run a simple command providing 2 inputs – -- **Spec Id** of the distribution that we want to install ( as obtained from the list command ) -- A **Name** by which this installation will be known locally. +Let's imagine you are working with a 8B-Instruct model. The following command will build a package (in the form of a Conda environment) _and_ configure it. As part of the configuration, you will be asked for some inputs (model_id, max_seq_len, etc.) ``` -llama distribution install --spec local --name local_llama_8b +llama stack build local --build-name llama-8b ``` -This will create a new conda environment (name can be passed optionally) and install dependencies (via pip) as required by the distro. - -Once it runs successfully , you should see some outputs in the form +Once it runs successfully , you should see some outputs in the form: ``` -llama distribution install --spec local --name local_llama_8b -``` -
+$ llama stack build local --build-name llama-8b +.... +.... Successfully installed cfgv-3.4.0 distlib-0.3.8 identify-2.6.0 libcst-1.4.0 llama_toolchain-0.0.2 moreorless-0.4.0 nodeenv-1.9.1 pre-commit-3.8.0 stdlibs-2024.5.15 toml-0.10.2 tomlkit-0.13.0 trailrunner-1.4.0 ufmt-2.7.0 usort-1.0.8 virtualenv-20.26.3 -Distribution `local_llama_8b` (with spec local) has been installed successfully! -+Successfully setup conda environment. Configuring build... -Next step is to configure the distribution that you just installed. We provide a simple CLI tool to enable simple configuration. -This command will walk you through the configuration process. -It will ask for some details like model name, paths to models, etc. +... +... -**NOTE**: You will have to download the models if not done already. Follow instructions here on how to download using the llama cli -``` -llama distribution configure --name local_llama_8b +YAML configuration has been written to ~/.llama/builds/stack/env-local-llama-8b.yaml ``` -Here is an example output of how the cli will guide you to fill the configuration: -
-Configuring API surface: inference +You can re-configure this distribution by running: +``` +llama stack configure local --build-name llama-8b +``` + +Here is an example run of how the CLI will guide you to fill the configuration +``` +$ llama stack configure local --build-name llama-8b + +Configuring API: inference (meta-reference) Enter value for model (required): Meta-Llama3.1-8B-Instruct Enter value for quantization (optional): Enter value for torch_seed (optional): Enter value for max_seq_len (required): 4096 Enter value for max_batch_size (default: 1): 1 -Configuring API surface: safety -Do you want to configure llama_guard_shield? (y/n): n -Do you want to configure prompt_guard_shield? (y/n): n -Configuring API surface: agentic_system +Configuring API: safety (meta-reference) +Do you want to configure llama_guard_shield? (y/n): y +Entering sub-configuration for llama_guard_shield: +Enter value for model (required): Llama-Guard-3-8B +Enter value for excluded_categories (required): [] +Enter value for disable_input_check (default: False): +Enter value for disable_output_check (default: False): +Do you want to configure prompt_guard_shield? (y/n): y +Entering sub-configuration for prompt_guard_shield: +Enter value for model (required): Prompt-Guard-86M +... +... +YAML configuration has been written to ~/.llama/builds/stack/env-local-llama-8b.yaml +``` -YAML configuration has been written to ~/.llama/distributions/local0/config.yaml -- -As you can see, we did basic configuration above and configured inference to run on model Meta-Llama3.1-8B-Instruct ( obtained from the llama model list command ). -For this initial setup we did not set up safety. +As you can see, we did basic configuration above and configured: +- inference to run on model `Meta-Llama3.1-8B-Instruct` (obtained from `llama model list`) +- Llama Guard safety shield with model `Llama-Guard-3-8B` +- Prompt Guard safety shield with model `Prompt-Guard-86M` For how these configurations are stored as yaml, checkout the file printed at the end of the configuration. -## Step 4: Starting a Distribution and Testing it +Note that all configurations as well as models are stored in `~/.llama` -Now let’s start the distribution using the cli. -``` -llama distribution start --name local_llama_8b --port 5000 -``` -You should see the distribution start and print the APIs that it is supporting: +## Step 4: Starting a Llama Stack Distribution and Testing it + +Now let’s start Llama Stack server. + +You need the YAML configuration file which was written out at the end by the `llama stack build` step. + +``` +llama stack start ~/.llama/builds/stack/env-local-llama-8b.yaml --port 5000 +``` +You should see the Stack server start and print the APIs that it is supporting, + +``` +$ llama stack start ~/.llama/builds/stack/env-local-llama-8b.yaml --port 5000 -
> initializing model parallel with size 1 > initializing ddp with size 1 > initializing pipeline with size 1 @@ -376,15 +411,23 @@ INFO: Started server process [453333] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) -- -Lets test with a client - ``` -cd /path/to/llama-toolchain -conda activate