Add CLI reference docs (#14)

* Add CLI reference doc * touchups * add helptext for download
2025-12-04 18:13:44 +00:00 · 2024-07-25 16:56:29 -04:00 · 2024-07-25 16:56:29 -04:00 · ec433448f2
commit ec433448f2
parent b8aa99b034
3 changed files with 174 additions and 38 deletions
--- a/README.md
+++ b/README.md
@ -1,11 +1,12 @@
 # llama-toolchain
 This repo contains the API specifications for various components of the Llama Stack as well implementations for some of those APIs like model inference.
-The Stack consists of toolchain-apis and agentic-apis. This repo contains the toolchain-apis
+
 The Llama Stack consists of toolchain-apis and agentic-apis. This repo contains the toolchain-apis.
 ## Installation
-You can install this repository as a [package](https://pypi.org/project/llama-toolchain/) by just doing `pip install llama-toolchain`
+You can install this repository as a [package](https://pypi.org/project/llama-toolchain/) with `pip install llama-toolchain`
 If you want to install from source:
@ -21,44 +22,13 @@ cd llama-toolchain
 pip install -e .
 ```
-## Test with cli
+## The Llama CLI
-We have built a llama cli to make it easy to configure / run parts of the toolchain
+The `llama` CLI makes it easy to configure and run the Llama toolchain. Read the [CLI reference](docs/cli_reference.md) for details.
 ```
 llama --help
-usage: llama [-h] {download,inference,model,agentic_system} ...
+## Appendix: Running FP8
-Welcome to the LLama cli
+If you want to run FP8, you need the `fbgemm-gpu` package which requires `torch >= 2.4.0` (currently only in nightly, but releasing shortly...)
 options:
  -h, --help            show this help message and exit
 subcommands:
  {download,inference,model,agentic_system}
 ```
 There are several subcommands to help get you started
 ## Start inference server that can run the llama models
 ```bash
 llama inference configure
 llama inference start
 ```
 ## Test client
 ```bash
 python -m llama_toolchain.inference.client localhost 5000
 Initializing client for http://localhost:5000
 User>hello world, help me out here
 Assistant> Hello! I'd be delighted to help you out. What's on your mind? Do you have a question, a problem, or just need someone to chat with? I'm all ears!
 ```
 ## Running FP8
 You need `fbgemm-gpu` package which requires torch >= 2.4.0 (currently only in nightly, but releasing shortly...).
 ```bash
 ENV=fp8_env
--- a/docs/cli_reference.md
+++ b/docs/cli_reference.md
@ -0,0 +1,166 @@
 # Llama CLI Reference
 The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-toolchain` package.
 ```
 $ llama --help
 Welcome to the Llama CLI
 Usage: llama [-h] {download,inference,model} ...
 Options:
  -h, --help            Show this help message and exit
 Subcommands:
  {download,inference,model}
 ```
 ## Step 1. Get the models
 First, you need models locally. You can get the models from [HuggingFace](https://huggingface.co/meta-llama) or [directly from Meta](https://llama.meta.com/llama-downloads/). The download command streamlines the process.
 ```
 $ llama download --help
 usage: llama download [-h] [--hf-token HF_TOKEN] [--ignore-patterns IGNORE_PATTERNS] repo_id
 Download a model from the Hugging Face Hub
 positional arguments:
  repo_id               Name of the repository on Hugging Face Hub eg. llhf/Meta-Llama-3.1-70B-Instruct
 options:
  -h, --help            show this help message and exit
  --hf-token HF_TOKEN   Hugging Face API token. Needed for gated models like Llama2. Will also try to read environment variable `HF_TOKEN` as default.
  --ignore-patterns IGNORE_PATTERNS
                        If provided, files matching any of the patterns are not downloaded. Defaults to ignoring safetensors files to avoid downloading duplicate weights.
 # Here are some examples on how to use this command:
 llama download --repo-id meta-llama/Llama-2-7b-hf --hf-token <HF_TOKEN>
 llama download --repo-id meta-llama/Llama-2-7b-hf --output-dir /data/my_custom_dir --hf-token <HF_TOKEN>
 HF_TOKEN=<HF_TOKEN> llama download --repo-id meta-llama/Llama-2-7b-hf
 The output directory will be used to load models and tokenizers for inference.
 ```
 1. Create and get a Hugging Face access token [here](https://huggingface.co/settings/tokens)
 2. Set the `HF_TOKEN` environment variable
 ```
 export HF_TOKEN=YOUR_TOKEN_HERE
 llama download meta-llama/Meta-Llama-3.1-70B-Instruct
 ```
 ## Step 2: Understand the models
 The `llama model` command helps you explore the model’s interface.
 ```
 $ llama model --help
 usage: llama model [-h] {template} ...
 Describe llama model interfaces
 options:
  -h, --help  show this help message and exit
 model_subcommands:
  {template}
 Example: llama model <subcommand> <options>
 ```
 You can run `llama model template` see all of the templates and their tokens:
 ```
 $ llama model template
 system-message-builtin-and-custom-tools
 system-message-builtin-tools-only
 system-message-custom-tools-only
 system-message-default
 assistant-message-builtin-tool-call
 assistant-message-custom-tool-call
 assistant-message-default
 tool-message-failure
 tool-message-success
 user-message-default
 ```
 And fetch an example by passing it to `--template`:
 ```
 llama model template --template tool-message-success
 llama model template --template tool-message-success
 <|start_header_id|>ipython<|end_header_id|>
 completed
 [stdout]{"results":["something something"]}[/stdout]<|eot_id|>
 ```
 ## Step 3. Start the inference server
 Once you have a model, the magic begins with inference. The `llama inference` command can help you configure and launch the Llama Stack inference server.
 ```
 $ llama inference --help
 usage: llama inference [-h] {start,configure} ...
 Run inference on a llama model
 options:
  -h, --help         show this help message and exit
 inference_subcommands:
  {start,configure}
 Example: llama inference start <options>
 ```
 Run `llama inference configure` to setup your configuration at `~/.llama/configs/inference.yaml`. You’ll set up variables like:
 * the directory where you stored the models you downloaded from step 1
 * the model parallel size (1 for 8B models, 8 for 70B/405B)
 Once you’ve configured the inference server, run `llama inference start`. The model will load into GPU and you’ll be able to send requests once you see the server ready.
 If you want to use a different model, re-run `llama inference configure` to update the model path and llama inference start to start again.
 Run `llama inference --help` for more information.
 ## Step 4. Start the agentic system
 The `llama agentic_system` command helps you configure and launch agentic systems. The `llama agentic_system configure` command sets up the configuration file the agentic code expects, and the `llama agentic_system start_app` command streamlines launching.
 For example, let’s run the included chat app:
 ```
 llama agentic_system configure
 llama agentic_system start_app chat
 ```
 For more information run `llama agentic_system --help`.
--- a/llama_toolchain/cli/llama.py
+++ b/llama_toolchain/cli/llama.py
@ -17,7 +17,7 @@ class LlamaCLIParser:
    def __init__(self):
        self.parser = argparse.ArgumentParser(
            prog="llama",
-            description="Welcome to the LLama cli",
+            description="Welcome to the Llama CLI",
            add_help=True,
        )