llama-stack/docs/cli_reference.md
2024-07-29 18:21:05 -07:00

165 lines
4.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Llama CLI Reference
The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-toolchain` package.
```
$ llama --help
Welcome to the Llama CLI
Usage: llama [-h] {download,inference,model} ...
Options:
-h, --help Show this help message and exit
Subcommands:
{download,inference,model}
```
## Step 1. Get the models
First, you need models locally. You can get the models from [HuggingFace](https://huggingface.co/meta-llama) or [directly from Meta](https://llama.meta.com/llama-downloads/). The download command streamlines the process.
```
$ llama download --help
usage: llama download [-h] [--hf-token HF_TOKEN] [--ignore-patterns IGNORE_PATTERNS] repo_id
Download a model from the Hugging Face Hub
positional arguments:
repo_id Name of the repository on Hugging Face Hub eg. llhf/Meta-Llama-3.1-70B-Instruct
options:
-h, --help show this help message and exit
--hf-token HF_TOKEN Hugging Face API token. Needed for gated models like Llama2. Will also try to read environment variable `HF_TOKEN` as default.
--ignore-patterns IGNORE_PATTERNS
If provided, files matching any of the patterns are not downloaded. Defaults to ignoring safetensors files to avoid downloading duplicate weights.
# Here are some examples on how to use this command:
llama download --repo-id meta-llama/Llama-2-7b-hf --hf-token <HF_TOKEN>
llama download --repo-id meta-llama/Llama-2-7b-hf --output-dir /data/my_custom_dir --hf-token <HF_TOKEN>
HF_TOKEN=<HF_TOKEN> llama download --repo-id meta-llama/Llama-2-7b-hf
The output directory will be used to load models and tokenizers for inference.
```
1. Create and get a Hugging Face access token [here](https://huggingface.co/settings/tokens)
2. Set the `HF_TOKEN` environment variable
```
export HF_TOKEN=YOUR_TOKEN_HERE
llama download meta-llama/Meta-Llama-3.1-70B-Instruct
```
## Step 2: Understand the models
The `llama model` command helps you explore the models interface.
```
$ llama model --help
usage: llama model [-h] {template} ...
Describe llama model interfaces
options:
-h, --help show this help message and exit
model_subcommands:
{template}
Example: llama model <subcommand> <options>
```
You can run `llama model template` see all of the templates and their tokens:
```
$ llama model template
system-message-builtin-and-custom-tools
system-message-builtin-tools-only
system-message-custom-tools-only
system-message-default
assistant-message-builtin-tool-call
assistant-message-custom-tool-call
assistant-message-default
tool-message-failure
tool-message-success
user-message-default
```
And fetch an example by passing it to `--name`:
```
llama model template --name tool-message-success
llama model template --name tool-message-success
<|start_header_id|>ipython<|end_header_id|>
completed
[stdout]{"results":["something something"]}[/stdout]<|eot_id|>
```
## Step 3. Start the inference server
Once you have a model, the magic begins with inference. The `llama inference` command can help you configure and launch the Llama Stack inference server.
```
$ llama inference --help
usage: llama inference [-h] {start,configure} ...
Run inference on a llama model
options:
-h, --help show this help message and exit
inference_subcommands:
{start,configure}
Example: llama inference start <options>
```
Run `llama inference configure` to setup your configuration at `~/.llama/configs/inference.yaml`. Youll set up variables like:
* the directory where you stored the models you downloaded from step 1
* the model parallel size (1 for 8B models, 8 for 70B/405B)
Once youve configured the inference server, run `llama inference start`. The model will load into GPU and youll be able to send requests once you see the server ready.
If you want to use a different model, re-run `llama inference configure` to update the model path and llama inference start to start again.
Run `llama inference --help` for more information.
## Step 4. Start the agentic system
The `llama agentic_system` command sets up the configuration file the agentic client code expects.
For example, lets run the included chat app:
```
llama agentic_system configure
mesop app/main.py
```
For more information run `llama agentic_system --help`.