* add tools to chat completion request * use templates for generating system prompts * Moved ToolPromptFormat and jinja templates to llama_models.llama3.api * <WIP> memory changes - inlined AgenticSystemInstanceConfig so API feels more ergonomic - renamed it to AgentConfig, AgentInstance -> Agent - added a MemoryConfig and `memory` parameter - added `attachments` to input and `output_attachments` to the response - some naming changes * InterleavedTextAttachment -> InterleavedTextMedia, introduce memory tool * flesh out memory banks API * agentic loop has a RAG implementation * faiss provider implementation * memory client works * re-work tool definitions, fix FastAPI issues, fix tool regressions * fix agentic_system utils * basic RAG seems to work * small bug fixes for inline attachments * Refactor custom tool execution utilities * Bug fix, show memory retrieval steps in EventLogger * No need for api_key for Remote providers * add special unicode character ↵ to showcase newlines in model prompt templates * remove api.endpoints imports * combine datatypes.py and endpoints.py into api.py * Attachment / add TTL api * split batch_inference from inference * minor import fixes * use a single impl for ChatFormat.decode_assistant_mesage * use interleaved_text_media_as_str() utilityt * Fix api.datatypes imports * Add blobfile for tiktoken * Add ToolPromptFormat to ChatFormat.encode_message so that tools are encoded properly * templates take optional --format={json,function_tag} * Rag Updates * Add `api build` subcommand -- WIP * fix * build + run image seems to work * <WIP> adapters * bunch more work to make adapters work * api build works for conda now * ollama remote adapter works * Several smaller fixes to make adapters work Also, reorganized the pattern of __init__ inside providers so configuration can stay lightweight * llama distribution -> llama stack + containers (WIP) * All the new CLI for api + stack work * Make Fireworks and Together into the Adapter format * Some quick fixes to the CLI behavior to make it consistent * Updated README phew * Update cli_reference.md * llama_toolchain/distribution -> llama_toolchain/core * Add termcolor * update paths * Add a log just for consistency * chmod +x scripts * Fix api dependencies not getting added to configuration * missing import lol * Delete utils.py; move to agentic system * Support downloading of URLs for attachments for code interpreter * Simplify and generalize `llama api build` yay * Update `llama stack configure` to be very simple also * Fix stack start * Allow building an "adhoc" distribution * Remote `llama api []` subcommands * Fixes to llama stack commands and update docs * Update documentation again and add error messages to llama stack start * llama stack start -> llama stack run * Change name of build for less confusion * Add pyopenapi fork to the repository, update RFC assets * Remove conflicting annotation * Added a "--raw" option for model template printing --------- Co-authored-by: Hardik Shah <hjshah@fb.com> Co-authored-by: Ashwin Bharambe <ashwin@meta.com> Co-authored-by: Dalton Flanagan <6599399+dltn@users.noreply.github.com>
25 KiB
Llama CLI Reference
The llama
CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the llama-toolchain
package.
Subcommands
download
:llama
cli tools supports downloading the model from Meta or HuggingFace.model
: Lists available models and their properties.stack
: Allows you to build and run a Llama Stack server. You can read more about this here.
Sample Usage
llama --help
usage: llama [-h] {download,model,stack,api} ... Welcome to the Llama CLI options: -h, --help show this help message and exit subcommands: {download,model,stack,api}
Step 1. Get the models
You first need to have models downloaded locally.
To download any model you need the Model Descriptor. This can be obtained by running the command
llama model list
You should see a table like this:
+---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Model Descriptor | HuggingFace Repo | Context Length | Hardware Requirements | +---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Meta-Llama3.1-8B | meta-llama/Meta-Llama-3.1-8B | 128K | 1 GPU, each >= 20GB VRAM | +---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Meta-Llama3.1-70B | meta-llama/Meta-Llama-3.1-70B | 128K | 8 GPUs, each >= 20GB VRAM | +---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Meta-Llama3.1-405B:bf16-mp8 | | 128K | 8 GPUs, each >= 120GB VRAM | +---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Meta-Llama3.1-405B | meta-llama/Meta-Llama-3.1-405B-FP8 | 128K | 8 GPUs, each >= 70GB VRAM | +---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Meta-Llama3.1-405B:bf16-mp16 | meta-llama/Meta-Llama-3.1-405B | 128K | 16 GPUs, each >= 70GB VRAM | +---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Meta-Llama3.1-8B-Instruct | meta-llama/Meta-Llama-3.1-8B-Instruct | 128K | 1 GPU, each >= 20GB VRAM | +---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Meta-Llama3.1-70B-Instruct | meta-llama/Meta-Llama-3.1-70B-Instruct | 128K | 8 GPUs, each >= 20GB VRAM | +---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Meta-Llama3.1-405B-Instruct:bf16-mp8 | | 128K | 8 GPUs, each >= 120GB VRAM | +---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Meta-Llama3.1-405B-Instruct | meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 | 128K | 8 GPUs, each >= 70GB VRAM | +---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Meta-Llama3.1-405B-Instruct:bf16-mp16 | meta-llama/Meta-Llama-3.1-405B-Instruct | 128K | 16 GPUs, each >= 70GB VRAM | +---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Llama-Guard-3-8B | meta-llama/Llama-Guard-3-8B | 128K | 1 GPU, each >= 20GB VRAM | +---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Llama-Guard-3-8B:int8-mp1 | meta-llama/Llama-Guard-3-8B-INT8 | 128K | 1 GPU, each >= 10GB VRAM | +---------------------------------------+---------------------------------------------+----------------+----------------------------+ | Prompt-Guard-86M | meta-llama/Prompt-Guard-86M | 128K | 1 GPU, each >= 1GB VRAM | +---------------------------------------+---------------------------------------------+----------------+----------------------------+
To download models, you can use the llama download command.
Here is an example download command to get the 8B/70B Instruct model. You will need META_URL which can be obtained from here
llama download --source meta --model-id Meta-Llama3.1-8B-Instruct --meta-url <META_URL>
llama download --source meta --model-id Meta-Llama3.1-70B-Instruct --meta-url <META_URL>
You can download from HuggingFace using these commands Set your environment variable HF_TOKEN or pass in --hf-token to the command to validate your access. You can find your token at here
llama download --source huggingface --model-id Meta-Llama3.1-8B-Instruct --hf-token <HF_TOKEN>
llama download --source huggingface --model-id Meta-Llama3.1-70B-Instruct --hf-token <HF_TOKEN>
You can also download safety models from HF
llama download --source huggingface --model-id Llama-Guard-3-8B --ignore-patterns *original*
llama download --source huggingface --model-id Prompt-Guard-86M --ignore-patterns *original*
Step 2: Understand the models
The llama model
command helps you explore the model’s interface.
2.1 Subcommands
download
: Download the model from different sources. (meta, huggingface)list
: Lists all the models available for download with hardware requirements to deploy the models.template
: <TODO: What is a template?>describe
: Describes all the properties of the model.
2.2 Sample Usage
llama model <subcommand> <options>
llama model --help
usage: llama model [-h] {download,list,template,describe} ... Work with llama models options: -h, --help show this help message and exit model_subcommands: {download,list,template,describe}
You can use the describe command to know more about a model:
llama model describe -m Meta-Llama3.1-8B-Instruct
2.3 Describe
+-----------------------------+---------------------------------------+ | Model | Meta- | | | Llama3.1-8B-Instruct | +-----------------------------+---------------------------------------+ | HuggingFace ID | meta-llama/Meta-Llama-3.1-8B-Instruct | +-----------------------------+---------------------------------------+ | Description | Llama 3.1 8b instruct model | +-----------------------------+---------------------------------------+ | Context Length | 128K tokens | +-----------------------------+---------------------------------------+ | Weights format | bf16 | +-----------------------------+---------------------------------------+ | Model params.json | { | | | "dim": 4096, | | | "n_layers": 32, | | | "n_heads": 32, | | | "n_kv_heads": 8, | | | "vocab_size": 128256, | | | "ffn_dim_multiplier": 1.3, | | | "multiple_of": 1024, | | | "norm_eps": 1e-05, | | | "rope_theta": 500000.0, | | | "use_scaled_rope": true | | | } | +-----------------------------+---------------------------------------+ | Recommended sampling params | { | | | "strategy": "top_p", | | | "temperature": 1.0, | | | "top_p": 0.9, | | | "top_k": 0 | | | } | +-----------------------------+---------------------------------------+
2.4 Template
You can even run llama model template
see all of the templates and their tokens:
llama model template
+-----------+---------------------------------+ | Role | Template Name | +-----------+---------------------------------+ | user | user-default | | assistant | assistant-builtin-tool-call | | assistant | assistant-custom-tool-call | | assistant | assistant-default | | system | system-builtin-and-custom-tools | | system | system-builtin-tools-only | | system | system-custom-tools-only | | system | system-default | | tool | tool-success | | tool | tool-failure | +-----------+---------------------------------+
And fetch an example by passing it to --name
:
llama model template --name tool-success
+----------+----------------------------------------------------------------+ | Name | tool-success | +----------+----------------------------------------------------------------+ | Template | <|start_header_id|>ipython<|end_header_id|> | | | | | | completed | | | [stdout]{"results":["something | | | something"]}[/stdout]<|eot_id|> | | | | +----------+----------------------------------------------------------------+ | Notes | Note ipython header and [stdout] | +----------+----------------------------------------------------------------+
Or:
llama model template --name system-builtin-tools-only
+----------+--------------------------------------------+ | Name | system-builtin-tools-only | +----------+--------------------------------------------+ | Template | <|start_header_id|>system<|end_header_id|> | | | | | | Environment: ipython | | | Tools: brave_search, wolfram_alpha | | | | | | Cutting Knowledge Date: December 2023 | | | Today Date: 21 August 2024 | | | <|eot_id|> | | | | +----------+--------------------------------------------+ | Notes | | +----------+--------------------------------------------+
These commands can help understand the model interface and how prompts / messages are formatted for various scenarios.
NOTE: Outputs in terminal are color printed to show special tokens.
Step 3: Building, Configuring and Running Llama Stack servers
An agentic app has several components including model inference, tool execution and system safety shields. Running all these components is made simpler (we hope!) with Llama Stack Distributions.
The Llama Stack is a collection of REST APIs. An API is implemented by Provider. An assembly of Providers together provides the implementation for the Stack -- this package is called a Distribution.
As an example, by running a simple command llama stack run
, you can bring up a server serving the following endpoints, among others:
POST /inference/chat_completion
POST /inference/completion
POST /safety/run_shields
POST /agentic_system/create
POST /agentic_system/session/create
POST /agentic_system/turn/create
POST /agentic_system/delete
The agentic app can now simply point to this server to execute all its needed components.
Lets build, configure and start a Llama Stack server specified via a "Distribution ID" to understand more !
Let’s start with listing available distributions:
llama stack list-distributions
i+--------------------------------+---------------------------------------+----------------------------------------------------------------------+ | Distribution ID | Providers | Description | +--------------------------------+---------------------------------------+----------------------------------------------------------------------+ | local | { | Use code from `llama_toolchain` itself to serve all llama stack APIs | | | "inference": "meta-reference", | | | | "memory": "meta-reference-faiss", | | | | "safety": "meta-reference", | | | | "agentic_system": "meta-reference" | | | | } | | +--------------------------------+---------------------------------------+----------------------------------------------------------------------+ | remote | { | Point to remote services for all llama stack APIs | | | "inference": "remote", | | | | "safety": "remote", | | | | "agentic_system": "remote", | | | | "memory": "remote" | | | | } | | +--------------------------------+---------------------------------------+----------------------------------------------------------------------+ | local-ollama | { | Like local, but use ollama for running LLM inference | | | "inference": "remote::ollama", | | | | "safety": "meta-reference", | | | | "agentic_system": "meta-reference", | | | | "memory": "meta-reference-faiss" | | | | } | | +--------------------------------+---------------------------------------+----------------------------------------------------------------------+ | local-plus-fireworks-inference | { | Use Fireworks.ai for running LLM inference | | | "inference": "remote::fireworks", | | | | "safety": "meta-reference", | | | | "agentic_system": "meta-reference", | | | | "memory": "meta-reference-faiss" | | | | } | | +--------------------------------+---------------------------------------+----------------------------------------------------------------------+ | local-plus-together-inference | { | Use Together.ai for running LLM inference | | | "inference": "remote::together", | | | | "safety": "meta-reference", | | | | "agentic_system": "meta-reference", | | | | "memory": "meta-reference-faiss" | | | | } | | +--------------------------------+---------------------------------------+----------------------------------------------------------------------+
As you can see above, each “distribution” details the “providers” it is composed of. For example, local
uses the “meta-reference” provider for inference while local-ollama relies on a different provider (Ollama) for inference. Similarly, you can use Fireworks or Together.AI for running inference as well.
To install a distribution, we run a simple command providing 2 inputs:
- Distribution Id of the distribution that we want to install ( as obtained from the list-distributions command )
- A Name for the specific build and configuration of this distribution.
Let's imagine you are working with a 8B-Instruct model. The following command will build a package (in the form of a Conda environment) and configure it. As part of the configuration, you will be asked for some inputs (model_id, max_seq_len, etc.) Since we are working with a 8B model, we will name our build 8b-instruct
to help us remember the config.
llama stack build local --name 8b-instruct
Once it runs successfully , you should see some outputs in the form:
$ llama stack build local --name 8b-instruct
....
....
Successfully installed cfgv-3.4.0 distlib-0.3.8 identify-2.6.0 libcst-1.4.0 llama_toolchain-0.0.2 moreorless-0.4.0 nodeenv-1.9.1 pre-commit-3.8.0 stdlibs-2024.5.15 toml-0.10.2 tomlkit-0.13.0 trailrunner-1.4.0 ufmt-2.7.0 usort-1.0.8 virtualenv-20.26.3
Successfully setup conda environment. Configuring build...
...
...
YAML configuration has been written to ~/.llama/builds/local/conda/8b-instruct.yaml
You can re-configure this distribution by running:
llama stack configure local --name 8b-instruct
Here is an example run of how the CLI will guide you to fill the configuration
$ llama stack configure local --name 8b-instruct
Configuring API: inference (meta-reference)
Enter value for model (required): Meta-Llama3.1-8B-Instruct
Enter value for quantization (optional):
Enter value for torch_seed (optional):
Enter value for max_seq_len (required): 4096
Enter value for max_batch_size (default: 1): 1
Configuring API: safety (meta-reference)
Do you want to configure llama_guard_shield? (y/n): y
Entering sub-configuration for llama_guard_shield:
Enter value for model (required): Llama-Guard-3-8B
Enter value for excluded_categories (required): []
Enter value for disable_input_check (default: False):
Enter value for disable_output_check (default: False):
Do you want to configure prompt_guard_shield? (y/n): y
Entering sub-configuration for prompt_guard_shield:
Enter value for model (required): Prompt-Guard-86M
...
...
YAML configuration has been written to ~/.llama/builds/local/conda/8b-instruct.yaml
As you can see, we did basic configuration above and configured:
- inference to run on model
Meta-Llama3.1-8B-Instruct
(obtained fromllama model list
) - Llama Guard safety shield with model
Llama-Guard-3-8B
- Prompt Guard safety shield with model
Prompt-Guard-86M
For how these configurations are stored as yaml, checkout the file printed at the end of the configuration.
Note that all configurations as well as models are stored in ~/.llama
Step 4: Starting a Llama Stack Distribution and Testing it
Now let’s start Llama Stack server.
You need the YAML configuration file which was written out at the end by the llama stack build
step.
llama stack run local --name 8b-instruct --port 5000
You should see the Stack server start and print the APIs that it is supporting,
$ llama stack run local --name 8b-instruct --port 5000
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 19.28 seconds
NCCL version 2.20.5+cuda12.4
Finished model load YES READY
Serving POST /inference/batch_chat_completion
Serving POST /inference/batch_completion
Serving POST /inference/chat_completion
Serving POST /inference/completion
Serving POST /safety/run_shields
Serving POST /agentic_system/memory_bank/attach
Serving POST /agentic_system/create
Serving POST /agentic_system/session/create
Serving POST /agentic_system/turn/create
Serving POST /agentic_system/delete
Serving POST /agentic_system/session/delete
Serving POST /agentic_system/memory_bank/detach
Serving POST /agentic_system/session/get
Serving POST /agentic_system/step/get
Serving POST /agentic_system/turn/get
Listening on :::5000
INFO: Started server process [453333]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
Note
Configuration is in
~/.llama/builds/local/conda/8b-instruct.yaml
. Feel free to increasemax_seq_len
.
Important
The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines.
This server is running a Llama model locally.
Lets test with a client.
cd /path/to/llama-stack
conda activate <env> # any environment containing the llama-toolchain pip package will work
python -m llama_toolchain.inference.client localhost 5000
This will run the chat completion client and query the distribution’s /inference/chat_completion API.
Here is an example output:
Initializing client for http://localhost:5000 User>hello world, troll me in two-paragraphs about 42 Assistant> You think you're so smart, don't you? You think you can just waltz in here and ask about 42, like it's some kind of trivial matter. Well, let me tell you, 42 is not just a number, it's a way of life. It's the answer to the ultimate question of life, the universe, and everything, according to Douglas Adams' magnum opus, "The Hitchhiker's Guide to the Galaxy". But do you know what's even more interesting about 42? It's that it's not actually the answer to anything, it's just a number that some guy made up to sound profound. You know what's even more hilarious? People like you who think they can just Google "42" and suddenly become experts on the subject. Newsflash: you're not a supercomputer, you're just a human being with a fragile ego and a penchant for thinking you're smarter than you actually are. 42 is just a number, a meaningless collection of digits that holds no significance whatsoever. So go ahead, keep thinking you're so clever, but deep down, you're just a pawn in the grand game of life, and 42 is just a silly little number that's been used to make you feel like you're part of something bigger than yourself. Ha!
Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by:
python -m llama_toolchain.safety.client localhost 5000