llama-stack-mirror/docs/cli_reference.md

# Llama CLI Reference

The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-toolchain` package.

### Subcommands
1. `download`: `llama` cli tools supports downloading the model from Meta or HuggingFace.
2. `model`: Lists available models and their properties.
3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this [here](/docs/cli_reference.md#step-3-building-configuring-and-running-llama-stack-servers).

### Sample Usage

```
llama --help
```
<pre style="font-family: monospace;">
usage: llama [-h] {download,model,stack} ...

Welcome to the Llama CLI

options:
  -h, --help            show this help message and exit

subcommands:
  {download,model,stack}
</pre>

## Step 1. Get the models

You first need to have models downloaded locally.

To download any model you need the **Model Descriptor**.
This can be obtained by running the command
```
llama model list
```

You should see a table like this:

<pre style="font-family: monospace;">
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Model Descriptor                      | HuggingFace Repo                            | Context Length | Hardware Requirements      |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-8B                      | meta-llama/Meta-Llama-3.1-8B                | 128K           | 1 GPU, each >= 20GB VRAM   |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-70B                     | meta-llama/Meta-Llama-3.1-70B               | 128K           | 8 GPUs, each >= 20GB VRAM  |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-405B:bf16-mp8           |                                             | 128K           | 8 GPUs, each >= 120GB VRAM |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-405B                    | meta-llama/Meta-Llama-3.1-405B-FP8          | 128K           | 8 GPUs, each >= 70GB VRAM  |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-405B:bf16-mp16          | meta-llama/Meta-Llama-3.1-405B              | 128K           | 16 GPUs, each >= 70GB VRAM |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-8B-Instruct             | meta-llama/Meta-Llama-3.1-8B-Instruct       | 128K           | 1 GPU, each >= 20GB VRAM   |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-70B-Instruct            | meta-llama/Meta-Llama-3.1-70B-Instruct      | 128K           | 8 GPUs, each >= 20GB VRAM  |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-405B-Instruct:bf16-mp8  |                                             | 128K           | 8 GPUs, each >= 120GB VRAM |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-405B-Instruct           | meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 | 128K           | 8 GPUs, each >= 70GB VRAM  |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Meta-Llama3.1-405B-Instruct:bf16-mp16 | meta-llama/Meta-Llama-3.1-405B-Instruct     | 128K           | 16 GPUs, each >= 70GB VRAM |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Llama-Guard-3-8B                      | meta-llama/Llama-Guard-3-8B                 | 128K           | 1 GPU, each >= 20GB VRAM   |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Llama-Guard-3-8B:int8-mp1             | meta-llama/Llama-Guard-3-8B-INT8            | 128K           | 1 GPU, each >= 10GB VRAM   |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
| Prompt-Guard-86M                      | meta-llama/Prompt-Guard-86M                 | 128K           | 1 GPU, each >= 1GB VRAM    |
+---------------------------------------+---------------------------------------------+----------------+----------------------------+
</pre>

To download models, you can use the llama download command.

#### Downloading from [Meta](https://llama.meta.com/llama-downloads/)

Here is an example download command to get the 8B/70B Instruct model. You will need META_URL which can be obtained from [here](https://llama.meta.com/docs/getting_the_models/meta/)

Download the required checkpoints using the following commands:
```bash
# download the 8B model, this can be run on a single GPU
llama download --source meta --model-id Meta-Llama3.1-8B-Instruct --meta-url META_URL

# you can also get the 70B model, this will require 8 GPUs however
llama download --source meta --model-id Meta-Llama3.1-70B-Instruct --meta-url META_URL

# llama-agents have safety enabled by default. For this, you will need
# safety models -- Llama-Guard and Prompt-Guard
llama download --source meta --model-id Prompt-Guard-86M --meta-url META_URL
llama download --source meta --model-id Llama-Guard-3-8B --meta-url META_URL
```

#### Downloading from [Huggingface](https://huggingface.co/meta-llama)

Essentially, the same commands above work, just replace `--source meta` with `--source huggingface`.

```bash
llama download --source huggingface --model-id  Meta-Llama3.1-8B-Instruct --hf-token <HF_TOKEN>

llama download --source huggingface --model-id Meta-Llama3.1-70B-Instruct --hf-token <HF_TOKEN>

llama download --source huggingface --model-id Llama-Guard-3-8B --ignore-patterns *original*
llama download --source huggingface --model-id Prompt-Guard-86M --ignore-patterns *original*
```

**Important:** Set your environment variable `HF_TOKEN` or pass in `--hf-token` to the command to validate your access. You can find your token at [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).

> **Tip:** Default for `llama download` is to run with `--ignore-patterns *.safetensors` since we use the `.pth` files in the `original` folder. For Llama Guard and Prompt Guard, however, we need safetensors. Hence, please run with `--ignore-patterns original` so that safetensors are downloaded and `.pth` files are ignored.

#### Downloading via Ollama

If you're already using ollama, we also have a supported Llama Stack distribution `local-ollama` and you can continue to use ollama for managing model downloads.

```
ollama pull llama3.1:8b-instruct-fp16
ollama pull llama3.1:70b-instruct-fp16
```

> [!NOTE]
> Only the above two models are currently supported by Ollama.


## Step 2: Understand the models
The `llama model` command helps you explore the model’s interface.

### 2.1 Subcommands
1. `download`: Download the model from different sources. (meta, huggingface)
2. `list`: Lists all the models available for download with hardware requirements to deploy the models.
3. `template`: <TODO: What is a template?>
4. `describe`: Describes all the properties of the model.

### 2.2 Sample Usage

`llama model <subcommand> <options>`

```
llama model --help
```
<pre style="font-family: monospace;">
usage: llama model [-h] {download,list,template,describe} ...

Work with llama models

options:
  -h, --help            show this help message and exit

model_subcommands:
  {download,list,template,describe}
</pre>

You can use the describe command to know more about a model:
```
llama model describe -m Meta-Llama3.1-8B-Instruct
```
### 2.3 Describe

<pre style="font-family: monospace;">
+-----------------------------+---------------------------------------+
| Model                       | Meta-                                 |
|                             | Llama3.1-8B-Instruct                  |
+-----------------------------+---------------------------------------+
| HuggingFace ID              | meta-llama/Meta-Llama-3.1-8B-Instruct |
+-----------------------------+---------------------------------------+
| Description                 | Llama 3.1 8b instruct model           |
+-----------------------------+---------------------------------------+
| Context Length              | 128K tokens                           |
+-----------------------------+---------------------------------------+
| Weights format              | bf16                                  |
+-----------------------------+---------------------------------------+
| Model params.json           | {                                     |
|                             |     "dim": 4096,                      |
|                             |     "n_layers": 32,                   |
|                             |     "n_heads": 32,                    |
|                             |     "n_kv_heads": 8,                  |
|                             |     "vocab_size": 128256,             |
|                             |     "ffn_dim_multiplier": 1.3,        |
|                             |     "multiple_of": 1024,              |
|                             |     "norm_eps": 1e-05,                |
|                             |     "rope_theta": 500000.0,           |
|                             |     "use_scaled_rope": true           |
|                             | }                                     |
+-----------------------------+---------------------------------------+
| Recommended sampling params | {                                     |
|                             |     "strategy": "top_p",              |
|                             |     "temperature": 1.0,               |
|                             |     "top_p": 0.9,                     |
|                             |     "top_k": 0                        |
|                             | }                                     |
+-----------------------------+---------------------------------------+
</pre>
### 2.4 Template
You can even run `llama model template` see all of the templates and their tokens:

```
llama model template
```

<pre style="font-family: monospace;">
+-----------+---------------------------------+
| Role      | Template Name                   |
+-----------+---------------------------------+
| user      | user-default                    |
| assistant | assistant-builtin-tool-call     |
| assistant | assistant-custom-tool-call      |
| assistant | assistant-default               |
| system    | system-builtin-and-custom-tools |
| system    | system-builtin-tools-only       |
| system    | system-custom-tools-only        |
| system    | system-default                  |
| tool      | tool-success                    |
| tool      | tool-failure                    |
+-----------+---------------------------------+
</pre>

And fetch an example by passing it to `--name`:
```
llama model template --name tool-success
```

<pre style="font-family: monospace;">
+----------+----------------------------------------------------------------+
| Name     | tool-success                                                   |
+----------+----------------------------------------------------------------+
| Template | <|start_header_id|>ipython<|end_header_id|>                    |
|          |                                                                |
|          | completed                                                      |
|          | [stdout]{"results":["something                                 |
|          | something"]}[/stdout]<|eot_id|>                                |
|          |                                                                |
+----------+----------------------------------------------------------------+
| Notes    | Note ipython header and [stdout]                               |
+----------+----------------------------------------------------------------+
</pre>

Or:
```
llama model template --name system-builtin-tools-only
```

<pre style="font-family: monospace;">
+----------+--------------------------------------------+
| Name     | system-builtin-tools-only                  |
+----------+--------------------------------------------+
| Template | <|start_header_id|>system<|end_header_id|> |
|          |                                            |
|          | Environment: ipython                       |
|          | Tools: brave_search, wolfram_alpha         |
|          |                                            |
|          | Cutting Knowledge Date: December 2023      |
|          | Today Date: 21 August 2024                 |
|          | <|eot_id|>                                 |
|          |                                            |
+----------+--------------------------------------------+
| Notes    |                                            |
+----------+--------------------------------------------+
</pre>

These commands can help understand the model interface and how prompts / messages are formatted for various scenarios.

**NOTE**: Outputs in terminal are color printed to show special tokens.


## Step 3: Building, and Configuring Llama Stack Distributions

- Please see our [Getting Started](getting_started.md) guide for details.

### Step 3.1. Build
In the following steps, imagine we'll be working with a `Meta-Llama3.1-8B-Instruct` model. We will name our build `8b-instruct` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify:
- `name`: the name for our distribution (e.g. `8b-instruct`)
- `image_type`: our build image type (`conda | docker`)
- `distribution_spec`: our distribution specs for specifying API providers
  - `description`: a short description of the configurations for the distribution
  - `providers`: specifies the underlying implementation for serving each API endpoint
  - `image_type`: `conda` | `docker` to specify whether to build the distribution in the form of Docker image or Conda environment.

#### Build a local distribution with conda
The following command and specifications allows you to get started with building.
```
llama stack build <path/to/config>
```
- You will be required to pass in a file path to the build.config file (e.g. `./llama_toolchain/configs/distributions/conda/local-conda-example-build.yaml`). We provide some example build config files for configuring different types of distributions in the `./llama_toolchain/configs/distributions/` folder.

The file will be of the contents
```
$ cat ./llama_toolchain/configs/distributions/conda/local-conda-example-build.yaml

name: 8b-instruct
distribution_spec:
  distribution_type: local
  description: Use code from `llama_toolchain` itself to serve all llama stack APIs
  docker_image: null
  providers:
    inference: meta-reference
    memory: meta-reference-faiss
    safety: meta-reference
    agentic_system: meta-reference
    telemetry: console
image_type: conda
```

You may run the `llama stack build` command to generate your distribution with `--name` to override the name for your distribution.
```
$ llama stack build ~/.llama/distributions/conda/8b-instruct-build.yaml --name 8b-instruct
...
...
Build spec configuration saved at ~/.llama/distributions/conda/8b-instruct-build.yaml
```

After this step is complete, a file named `8b-instruct-build.yaml` will be generated and saved at `~/.llama/distributions/conda/8b-instruct-build.yaml`.


#### How to build distribution with different API providers using configs
To specify a different API provider, we can change the `distribution_spec` in our `<name>-build.yaml` config. For example, the following build spec allows you to build a distribution using TGI as the inference API provider.

```
$ cat ./llama_toolchain/configs/distributions/conda/local-tgi-conda-example-build.yaml

name: local-tgi-conda-example
distribution_spec:
  description: Use TGI (local or with Hugging Face Inference Endpoints for running LLM inference. When using HF Inference Endpoints, you must provide the name of the endpoint).
  docker_image: null
  providers:
    inference: remote::tgi
    memory: meta-reference-faiss
    safety: meta-reference
    agentic_system: meta-reference
    telemetry: console
image_type: conda
```

The following command allows you to build a distribution with TGI as the inference API provider, with the name `tgi`.
```
llama stack build --config ./llama_toolchain/configs/distributions/conda/local-tgi-conda-example-build.yaml --name tgi
```

We provide some example build configs to help you get started with building with different API providers.

#### How to build distribution with Docker image
To build a docker image, simply change the `image_type` to `docker` in our `<name>-build.yaml` file, and run `llama stack build --config <name>-build.yaml`.

```
$ cat ./llama_toolchain/configs/distributions/docker/local-docker-example-build.yaml

name: local-docker-example
distribution_spec:
  description: Use code from `llama_toolchain` itself to serve all llama stack APIs
  docker_image: null
  providers:
    inference: meta-reference
    memory: meta-reference-faiss
    safety: meta-reference
    agentic_system: meta-reference
    telemetry: console
image_type: docker
```

The following command allows you to build a Docker image with the name `docker-local`
```
llama stack build --config ./llama_toolchain/configs/distributions/docker/local-docker-example-build.yaml --name docker-local

Dockerfile created successfully in /tmp/tmp.I0ifS2c46A/DockerfileFROM python:3.10-slim
WORKDIR /app
...
...
You can run it with: podman run -p 8000:8000 llamastack-docker-local
Build spec configuration saved at /home/xiyan/.llama/distributions/docker/docker-local-build.yaml
```

### Step 3.2. Configure
After our distribution is built (either in form of docker or conda environment), we will run the following command to
```
llama stack configure [<path/to/name.build.yaml> | <docker-image-name>]
```
- For `conda` environments: <path/to/name.build.yaml> would be the generated build spec saved from Step 1.
- For `docker` images downloaded from Dockerhub, you could also use <docker-image-name> as the argument.
   - Run `docker images` to check list of available images on your machine.

```
$ llama stack configure ~/.llama/distributions/conda/8b-instruct-build.yaml

Configuring API: inference (meta-reference)
Enter value for model (existing: Meta-Llama3.1-8B-Instruct) (required):
Enter value for quantization (optional):
Enter value for torch_seed (optional):
Enter value for max_seq_len (existing: 4096) (required):
Enter value for max_batch_size (existing: 1) (required):

Configuring API: memory (meta-reference-faiss)

Configuring API: safety (meta-reference)
Do you want to configure llama_guard_shield? (y/n): y
Entering sub-configuration for llama_guard_shield:
Enter value for model (default: Llama-Guard-3-8B) (required):
Enter value for excluded_categories (default: []) (required):
Enter value for disable_input_check (default: False) (required):
Enter value for disable_output_check (default: False) (required):
Do you want to configure prompt_guard_shield? (y/n): y
Entering sub-configuration for prompt_guard_shield:
Enter value for model (default: Prompt-Guard-86M) (required):

Configuring API: agentic_system (meta-reference)
Enter value for brave_search_api_key (optional):
Enter value for bing_search_api_key (optional):
Enter value for wolfram_api_key (optional):

Configuring API: telemetry (console)

YAML configuration has been written to ~/.llama/builds/conda/8b-instruct-run.yaml
```

After this step is successful, you should be able to find a run configuration spec in `~/.llama/builds/conda/8b-instruct-run.yaml` with the following contents. You may edit this file to change the settings.

As you can see, we did basic configuration above and configured:
- inference to run on model `Meta-Llama3.1-8B-Instruct` (obtained from `llama model list`)
- Llama Guard safety shield with model `Llama-Guard-3-8B`
- Prompt Guard safety shield with model `Prompt-Guard-86M`

For how these configurations are stored as yaml, checkout the file printed at the end of the configuration.

Note that all configurations as well as models are stored in `~/.llama`


#### Step 3.2.1 API Keys for Tools

API key configuration for the Agentic System will be asked by the `llama stack build` script when you install a Llama Stack distribution.

Tools that the model supports and which need API Keys --
- Brave for web search (https://api.search.brave.com/register)
- Wolfram for math operations (https://developer.wolframalpha.com/)

> **Tip** If you do not have API keys, you can still run the app without model having access to the tools.

### Step 3.3. Run
Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack configure` step.

```
llama stack run ~/.llama/builds/conda/8b-instruct-run.yaml
```

You should see the Llama Stack server start and print the APIs that it is supporting

```
$ llama stack run ~/.llama/builds/local/conda/8b-instruct.yaml

> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 19.28 seconds
NCCL version 2.20.5+cuda12.4
Finished model load YES READY
Serving POST /inference/batch_chat_completion
Serving POST /inference/batch_completion
Serving POST /inference/chat_completion
Serving POST /inference/completion
Serving POST /safety/run_shields
Serving POST /agentic_system/memory_bank/attach
Serving POST /agentic_system/create
Serving POST /agentic_system/session/create
Serving POST /agentic_system/turn/create
Serving POST /agentic_system/delete
Serving POST /agentic_system/session/delete
Serving POST /agentic_system/memory_bank/detach
Serving POST /agentic_system/session/get
Serving POST /agentic_system/step/get
Serving POST /agentic_system/turn/get
Listening on :::5000
INFO:     Started server process [453333]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
```

> [!NOTE]
> Configuration is in `~/.llama/builds/local/conda/8b-instruct.yaml`. Feel free to increase `max_seq_len`.

> [!IMPORTANT]
> The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines.

This server is running a Llama model locally.

### Step 3.4 Test with Client
Once the server is setup, we can test it with a client to see the example outputs.
```
cd /path/to/llama-stack
conda activate <env>  # any environment containing the llama-toolchain pip package will work

python -m llama_toolchain.inference.client localhost 5000
```

This will run the chat completion client and query the distribution’s /inference/chat_completion API.

Here is an example output:
```
Initializing client for http://localhost:5000
User>hello world, troll me in two-paragraphs about 42

Assistant> You think you're so smart, don't you? You think you can just waltz in here and ask about 42, like it's some kind of trivial matter. Well, let me tell you, 42 is not just a number, it's a way of life. It's the answer to the ultimate question of life, the universe, and everything, according to Douglas Adams' magnum opus, "The Hitchhiker's Guide to the Galaxy". But do you know what's even more interesting about 42? It's that it's not actually the answer to anything, it's just a number that some guy made up to sound profound.

You know what's even more hilarious? People like you who think they can just Google "42" and suddenly become experts on the subject. Newsflash: you're not a supercomputer, you're just a human being with a fragile ego and a penchant for thinking you're smarter than you actually are. 42 is just a number, a meaningless collection of digits that holds no significance whatsoever. So go ahead, keep thinking you're so clever, but deep down, you're just a pawn in the grand game of life, and 42 is just a silly little number that's been used to make you feel like you're part of something bigger than yourself. Ha!
```

Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by:

```
python -m llama_toolchain.safety.client localhost 5000
```

You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/sdk_examples) repo.