Add CLI reference docs (#14)

* Add CLI reference doc

* touchups

* add helptext for download
This commit is contained in:
Dalton Flanagan 2024-07-25 16:56:29 -04:00 committed by GitHub
parent b8aa99b034
commit ec433448f2
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 174 additions and 38 deletions

View file

@ -1,11 +1,12 @@
# llama-toolchain # llama-toolchain
This repo contains the API specifications for various components of the Llama Stack as well implementations for some of those APIs like model inference. This repo contains the API specifications for various components of the Llama Stack as well implementations for some of those APIs like model inference.
The Stack consists of toolchain-apis and agentic-apis. This repo contains the toolchain-apis
The Llama Stack consists of toolchain-apis and agentic-apis. This repo contains the toolchain-apis.
## Installation ## Installation
You can install this repository as a [package](https://pypi.org/project/llama-toolchain/) by just doing `pip install llama-toolchain` You can install this repository as a [package](https://pypi.org/project/llama-toolchain/) with `pip install llama-toolchain`
If you want to install from source: If you want to install from source:
@ -21,44 +22,13 @@ cd llama-toolchain
pip install -e . pip install -e .
``` ```
## Test with cli ## The Llama CLI
We have built a llama cli to make it easy to configure / run parts of the toolchain The `llama` CLI makes it easy to configure and run the Llama toolchain. Read the [CLI reference](docs/cli_reference.md) for details.
```
llama --help
usage: llama [-h] {download,inference,model,agentic_system} ... ## Appendix: Running FP8
Welcome to the LLama cli If you want to run FP8, you need the `fbgemm-gpu` package which requires `torch >= 2.4.0` (currently only in nightly, but releasing shortly...)
options:
-h, --help show this help message and exit
subcommands:
{download,inference,model,agentic_system}
```
There are several subcommands to help get you started
## Start inference server that can run the llama models
```bash
llama inference configure
llama inference start
```
## Test client
```bash
python -m llama_toolchain.inference.client localhost 5000
Initializing client for http://localhost:5000
User>hello world, help me out here
Assistant> Hello! I'd be delighted to help you out. What's on your mind? Do you have a question, a problem, or just need someone to chat with? I'm all ears!
```
## Running FP8
You need `fbgemm-gpu` package which requires torch >= 2.4.0 (currently only in nightly, but releasing shortly...).
```bash ```bash
ENV=fp8_env ENV=fp8_env

166
docs/cli_reference.md Normal file
View file

@ -0,0 +1,166 @@
# Llama CLI Reference
The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-toolchain` package.
```
$ llama --help
Welcome to the Llama CLI
Usage: llama [-h] {download,inference,model} ...
Options:
-h, --help Show this help message and exit
Subcommands:
{download,inference,model}
```
## Step 1. Get the models
First, you need models locally. You can get the models from [HuggingFace](https://huggingface.co/meta-llama) or [directly from Meta](https://llama.meta.com/llama-downloads/). The download command streamlines the process.
```
$ llama download --help
usage: llama download [-h] [--hf-token HF_TOKEN] [--ignore-patterns IGNORE_PATTERNS] repo_id
Download a model from the Hugging Face Hub
positional arguments:
repo_id Name of the repository on Hugging Face Hub eg. llhf/Meta-Llama-3.1-70B-Instruct
options:
-h, --help show this help message and exit
--hf-token HF_TOKEN Hugging Face API token. Needed for gated models like Llama2. Will also try to read environment variable `HF_TOKEN` as default.
--ignore-patterns IGNORE_PATTERNS
If provided, files matching any of the patterns are not downloaded. Defaults to ignoring safetensors files to avoid downloading duplicate weights.
# Here are some examples on how to use this command:
llama download --repo-id meta-llama/Llama-2-7b-hf --hf-token <HF_TOKEN>
llama download --repo-id meta-llama/Llama-2-7b-hf --output-dir /data/my_custom_dir --hf-token <HF_TOKEN>
HF_TOKEN=<HF_TOKEN> llama download --repo-id meta-llama/Llama-2-7b-hf
The output directory will be used to load models and tokenizers for inference.
```
1. Create and get a Hugging Face access token [here](https://huggingface.co/settings/tokens)
2. Set the `HF_TOKEN` environment variable
```
export HF_TOKEN=YOUR_TOKEN_HERE
llama download meta-llama/Meta-Llama-3.1-70B-Instruct
```
## Step 2: Understand the models
The `llama model` command helps you explore the models interface.
```
$ llama model --help
usage: llama model [-h] {template} ...
Describe llama model interfaces
options:
-h, --help show this help message and exit
model_subcommands:
{template}
Example: llama model <subcommand> <options>
```
You can run `llama model template` see all of the templates and their tokens:
```
$ llama model template
system-message-builtin-and-custom-tools
system-message-builtin-tools-only
system-message-custom-tools-only
system-message-default
assistant-message-builtin-tool-call
assistant-message-custom-tool-call
assistant-message-default
tool-message-failure
tool-message-success
user-message-default
```
And fetch an example by passing it to `--template`:
```
llama model template --template tool-message-success
llama model template --template tool-message-success
<|start_header_id|>ipython<|end_header_id|>
completed
[stdout]{"results":["something something"]}[/stdout]<|eot_id|>
```
## Step 3. Start the inference server
Once you have a model, the magic begins with inference. The `llama inference` command can help you configure and launch the Llama Stack inference server.
```
$ llama inference --help
usage: llama inference [-h] {start,configure} ...
Run inference on a llama model
options:
-h, --help show this help message and exit
inference_subcommands:
{start,configure}
Example: llama inference start <options>
```
Run `llama inference configure` to setup your configuration at `~/.llama/configs/inference.yaml`. Youll set up variables like:
* the directory where you stored the models you downloaded from step 1
* the model parallel size (1 for 8B models, 8 for 70B/405B)
Once youve configured the inference server, run `llama inference start`. The model will load into GPU and youll be able to send requests once you see the server ready.
If you want to use a different model, re-run `llama inference configure` to update the model path and llama inference start to start again.
Run `llama inference --help` for more information.
## Step 4. Start the agentic system
The `llama agentic_system` command helps you configure and launch agentic systems. The `llama agentic_system configure` command sets up the configuration file the agentic code expects, and the `llama agentic_system start_app` command streamlines launching.
For example, lets run the included chat app:
```
llama agentic_system configure
llama agentic_system start_app chat
```
For more information run `llama agentic_system --help`.

View file

@ -17,7 +17,7 @@ class LlamaCLIParser:
def __init__(self): def __init__(self):
self.parser = argparse.ArgumentParser( self.parser = argparse.ArgumentParser(
prog="llama", prog="llama",
description="Welcome to the LLama cli", description="Welcome to the Llama CLI",
add_help=True, add_help=True,
) )