Updated README phew

This commit is contained in:
Ashwin Bharambe 2024-08-28 17:34:23 -07:00
parent 3063329dad
commit 896f057b76
2 changed files with 125 additions and 82 deletions

View file

@ -2,10 +2,11 @@
The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-toolchain` package. The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-toolchain` package.
### Subcommands ### Subcommands
1. `download`: `llama` cli tools supports downloading the model from Meta or HuggingFace. 1. `download`: `llama` cli tools supports downloading the model from Meta or HuggingFace.
2. `model`: Lists available models and their properties. 2. `model`: Lists available models and their properties.
3. `distribution`: A distribution is a set of REST APIs, this command allows you to manage (list, install, create, configure, start) distributions. You can read more about this [here](https://github.com/meta-llama/llama-stack/blob/main/docs/cli_reference.md#step-3-installing-and-configuring-distributions). 3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this [here](https://github.com/meta-llama/llama-stack/blob/main/docs/cli_reference.md#step-3-installing-and-configuring-distributions).
4. `api`: Allows you to build and run individual API providers (pieces) from the Llama Stack.
### Sample Usage ### Sample Usage
@ -13,7 +14,7 @@ The `llama` CLI tool helps you setup and use the Llama toolchain & agentic syste
llama --help llama --help
``` ```
<pre style="font-family: monospace;"> <pre style="font-family: monospace;">
usage: llama [-h] {download,model,distribution} ... usage: llama [-h] {download,model,stack,api} ...
Welcome to the Llama CLI Welcome to the Llama CLI
@ -21,7 +22,7 @@ options:
-h, --help show this help message and exit -h, --help show this help message and exit
subcommands: subcommands:
{download,model,distribution} {download,model,stack,api}
</pre> </pre>
## Step 1. Get the models ## Step 1. Get the models
@ -101,9 +102,9 @@ The `llama model` command helps you explore the models interface.
### 2.1 Subcommands ### 2.1 Subcommands
1. `download`: Download the model from different sources. (meta, huggingface) 1. `download`: Download the model from different sources. (meta, huggingface)
2. `list`: Lists all the models available for download with hardware requirements to deploy the models. 2. `list`: Lists all the models available for download with hardware requirements to deploy the models.
3. `template`: <TODO: What is a template?> 3. `template`: <TODO: What is a template?>
4. `describe`: Describes all the properties of the model. 4. `describe`: Describes all the properties of the model.
### 2.2 Sample Usage ### 2.2 Sample Usage
@ -236,11 +237,13 @@ These commands can help understand the model interface and how prompts / message
**NOTE**: Outputs in terminal are color printed to show special tokens. **NOTE**: Outputs in terminal are color printed to show special tokens.
## Step 3: Installing and Configuring Distributions ## Step 3: Building, Configuring and Running Llama Stack servers
An agentic app has several components including model inference, tool execution and system safety shields. Running all these components is made simpler (we hope!) with Llama Stack Distributions. An agentic app has several components including model inference, tool execution and system safety shields. Running all these components is made simpler (we hope!) with Llama Stack Distributions.
A Distribution is simply a collection of REST API providers that are part of the Llama stack. As an example, by running a simple command `llama distribution start`, you can bring up a server serving the following endpoints, among others: The Llama Stack is a collection of REST APIs. An API is _implemented_ by Provider. An assembly of Providers together provides the implementation for the Stack -- this package is called a Distribution.
As an example, by running a simple command `llama stack start <YAML>`, you can bring up a server serving the following endpoints, among others:
``` ```
POST /inference/chat_completion POST /inference/chat_completion
POST /inference/completion POST /inference/completion
@ -253,103 +256,135 @@ POST /agentic_system/delete
The agentic app can now simply point to this server to execute all its needed components. The agentic app can now simply point to this server to execute all its needed components.
A distributions behavior can be configured by defining a specification or “spec”. This specification lays out the different API “Providers” that constitute this distribution. Lets build, configure and start a Llama Stack server specified via a "Distribution ID" to understand more !
Lets install, configure and start a distribution to understand more ! Lets start with listing available distributions:
Lets start with listing available distributions
``` ```
llama distribution list llama stack list-distributions
``` ```
<pre style="font-family: monospace;"> <pre style="font-family: monospace;">
+--------------+---------------------------------------------+----------------------------------------------------------------------+ i+--------------------------------+---------------------------------------+----------------------------------------------------------------------+
| Spec ID | ProviderSpecs | Description | | Distribution ID | Providers | Description |
+--------------+---------------------------------------------+----------------------------------------------------------------------+ +--------------------------------+---------------------------------------+----------------------------------------------------------------------+
| local | { | Use code from `llama_toolchain` itself to serve all llama stack APIs | | local | { | Use code from `llama_toolchain` itself to serve all llama stack APIs |
| | "inference": "meta-reference", | | | | "inference": "meta-reference", | |
| | "safety": "meta-reference", | | | | "memory": "meta-reference-faiss", | |
| | "agentic_system": "meta-reference" | | | | "safety": "meta-reference", | |
| | } | | | | "agentic_system": "meta-reference" | |
+--------------+---------------------------------------------+----------------------------------------------------------------------+ | | } | |
| remote | { | Point to remote services for all llama stack APIs | +--------------------------------+---------------------------------------+----------------------------------------------------------------------+
| | "inference": "inference-remote", | | | remote | { | Point to remote services for all llama stack APIs |
| | "safety": "safety-remote", | | | | "inference": "remote", | |
| | "agentic_system": "agentic_system-remote" | | | | "safety": "remote", | |
| | } | | | | "agentic_system": "remote", | |
+--------------+---------------------------------------------+----------------------------------------------------------------------+ | | "memory": "remote" | |
| local-ollama | { | Like local, but use ollama for running LLM inference | | | } | |
| | "inference": "meta-ollama", | | +--------------------------------+---------------------------------------+----------------------------------------------------------------------+
| | "safety": "meta-reference", | | | local-ollama | { | Like local, but use ollama for running LLM inference |
| | "agentic_system": "meta-reference" | | | | "inference": "remote::ollama", | |
| | } | | | | "safety": "meta-reference", | |
+--------------+---------------------------------------------+----------------------------------------------------------------------+ | | "agentic_system": "meta-reference", | |
| | "memory": "meta-reference-faiss" | |
| | } | |
+--------------------------------+---------------------------------------+----------------------------------------------------------------------+
| local-plus-fireworks-inference | { | Use Fireworks.ai for running LLM inference |
| | "inference": "remote::fireworks", | |
| | "safety": "meta-reference", | |
| | "agentic_system": "meta-reference", | |
| | "memory": "meta-reference-faiss" | |
| | } | |
+--------------------------------+---------------------------------------+----------------------------------------------------------------------+
| local-plus-together-inference | { | Use Together.ai for running LLM inference |
| | "inference": "remote::together", | |
| | "safety": "meta-reference", | |
| | "agentic_system": "meta-reference", | |
| | "memory": "meta-reference-faiss" | |
| | } | |
+--------------------------------+---------------------------------------+----------------------------------------------------------------------+
</pre> </pre>
As you can see above, each “spec” details the “providers” that make up that spec. For eg. The `local` spec uses the “meta-reference” provider for inference while the `local-ollama` spec relies on a different provider ( ollama ) for inference. As you can see above, each “distribution” details the “providers” it is composed of. For example, `local` uses the “meta-reference” provider for inference while local-ollama relies on a different provider (Ollama) for inference. Similarly, you can use Fireworks or Together.AI for running inference as well.
Lets install the fully local implementation of the llama-stack named `local` above. To install a distribution, we run a simple command providing 2 inputs:
- **Distribution Id** of the distribution that we want to install ( as obtained from the list-distributions command )
- A **Name** for the specific build and configuration of this distribution.
To install a distro, we run a simple command providing 2 inputs Let's imagine you are working with a 8B-Instruct model. The following command will build a package (in the form of a Conda environment) _and_ configure it. As part of the configuration, you will be asked for some inputs (model_id, max_seq_len, etc.)
- **Spec Id** of the distribution that we want to install ( as obtained from the list command )
- A **Name** by which this installation will be known locally.
``` ```
llama distribution install --spec local --name local_llama_8b llama stack build local --build-name llama-8b
``` ```
This will create a new conda environment (name can be passed optionally) and install dependencies (via pip) as required by the distro. Once it runs successfully , you should see some outputs in the form:
Once it runs successfully , you should see some outputs in the form
``` ```
llama distribution install --spec local --name local_llama_8b $ llama stack build local --build-name llama-8b
``` ....
<pre style="font-family: monospace;"> ....
Successfully installed cfgv-3.4.0 distlib-0.3.8 identify-2.6.0 libcst-1.4.0 llama_toolchain-0.0.2 moreorless-0.4.0 nodeenv-1.9.1 pre-commit-3.8.0 stdlibs-2024.5.15 toml-0.10.2 tomlkit-0.13.0 trailrunner-1.4.0 ufmt-2.7.0 usort-1.0.8 virtualenv-20.26.3 Successfully installed cfgv-3.4.0 distlib-0.3.8 identify-2.6.0 libcst-1.4.0 llama_toolchain-0.0.2 moreorless-0.4.0 nodeenv-1.9.1 pre-commit-3.8.0 stdlibs-2024.5.15 toml-0.10.2 tomlkit-0.13.0 trailrunner-1.4.0 ufmt-2.7.0 usort-1.0.8 virtualenv-20.26.3
Distribution `local_llama_8b` (with spec local) has been installed successfully! Successfully setup conda environment. Configuring build...
</pre>
Next step is to configure the distribution that you just installed. We provide a simple CLI tool to enable simple configuration. ...
This command will walk you through the configuration process. ...
It will ask for some details like model name, paths to models, etc.
**NOTE**: You will have to download the models if not done already. Follow instructions here on how to download using the llama cli YAML configuration has been written to ~/.llama/builds/stack/env-local-llama-8b.yaml
```
llama distribution configure --name local_llama_8b
``` ```
Here is an example output of how the cli will guide you to fill the configuration: You can re-configure this distribution by running:
<pre style="font-family: monospace;"> ```
Configuring API surface: inference llama stack configure local --build-name llama-8b
```
Here is an example run of how the CLI will guide you to fill the configuration
```
$ llama stack configure local --build-name llama-8b
Configuring API: inference (meta-reference)
Enter value for model (required): Meta-Llama3.1-8B-Instruct Enter value for model (required): Meta-Llama3.1-8B-Instruct
Enter value for quantization (optional): Enter value for quantization (optional):
Enter value for torch_seed (optional): Enter value for torch_seed (optional):
Enter value for max_seq_len (required): 4096 Enter value for max_seq_len (required): 4096
Enter value for max_batch_size (default: 1): 1 Enter value for max_batch_size (default: 1): 1
Configuring API surface: safety Configuring API: safety (meta-reference)
Do you want to configure llama_guard_shield? (y/n): n Do you want to configure llama_guard_shield? (y/n): y
Do you want to configure prompt_guard_shield? (y/n): n Entering sub-configuration for llama_guard_shield:
Configuring API surface: agentic_system Enter value for model (required): Llama-Guard-3-8B
Enter value for excluded_categories (required): []
Enter value for disable_input_check (default: False):
Enter value for disable_output_check (default: False):
Do you want to configure prompt_guard_shield? (y/n): y
Entering sub-configuration for prompt_guard_shield:
Enter value for model (required): Prompt-Guard-86M
...
...
YAML configuration has been written to ~/.llama/builds/stack/env-local-llama-8b.yaml
```
YAML configuration has been written to ~/.llama/distributions/local0/config.yaml As you can see, we did basic configuration above and configured:
</pre> - inference to run on model `Meta-Llama3.1-8B-Instruct` (obtained from `llama model list`)
- Llama Guard safety shield with model `Llama-Guard-3-8B`
As you can see, we did basic configuration above and configured inference to run on model Meta-Llama3.1-8B-Instruct ( obtained from the llama model list command ). - Prompt Guard safety shield with model `Prompt-Guard-86M`
For this initial setup we did not set up safety.
For how these configurations are stored as yaml, checkout the file printed at the end of the configuration. For how these configurations are stored as yaml, checkout the file printed at the end of the configuration.
## Step 4: Starting a Distribution and Testing it Note that all configurations as well as models are stored in `~/.llama`
Now lets start the distribution using the cli. ## Step 4: Starting a Llama Stack Distribution and Testing it
```
llama distribution start --name local_llama_8b --port 5000 Now lets start Llama Stack server.
```
You should see the distribution start and print the APIs that it is supporting: You need the YAML configuration file which was written out at the end by the `llama stack build` step.
```
llama stack start ~/.llama/builds/stack/env-local-llama-8b.yaml --port 5000
```
You should see the Stack server start and print the APIs that it is supporting,
```
$ llama stack start ~/.llama/builds/stack/env-local-llama-8b.yaml --port 5000
<pre style="font-family: monospace;">
> initializing model parallel with size 1 > initializing model parallel with size 1
> initializing ddp with size 1 > initializing ddp with size 1
> initializing pipeline with size 1 > initializing pipeline with size 1
@ -376,15 +411,23 @@ INFO: Started server process [453333]
INFO: Waiting for application startup. INFO: Waiting for application startup.
INFO: Application startup complete. INFO: Application startup complete.
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
</pre>
Lets test with a client
``` ```
cd /path/to/llama-toolchain
conda activate <env-for-distribution> # ( Eg. local_llama_8b in above example )
python -m llama_toolchain.inference.client localhost 5000
> [!NOTE]
> Configuration is in `~/.llama/builds/stack/env-local-llama-8b.yaml`. Feel free to increase `max_seq_len`.
> [!IMPORTANT]
> The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines.
This server is running a Llama model locally.
Lets test with a client.
```
cd /path/to/llama-stack
conda activate <env> # any environment containing the llama-toolchain pip package will work
python -m llama_toolchain.inference.client localhost 5000
``` ```
This will run the chat completion client and query the distributions /inference/chat_completion API. This will run the chat completion client and query the distributions /inference/chat_completion API.

View file

@ -30,8 +30,8 @@ class LlamaCLIParser:
# Add sub-commands # Add sub-commands
Download.create(subparsers) Download.create(subparsers)
ModelParser.create(subparsers) ModelParser.create(subparsers)
ApiParser.create(subparsers)
StackParser.create(subparsers) StackParser.create(subparsers)
ApiParser.create(subparsers)
# Import sub-commands from agentic_system if they exist # Import sub-commands from agentic_system if they exist
try: try: