Update CLI_reference

2025-12-03 09:53:45 +00:00 · 2024-08-06 22:18:02 -07:00 · 2024-08-06 22:18:02 -07:00 · cc697c59e5
commit cc697c59e5
parent e1a7aa4773
1 changed files with 60 additions and 45 deletions
--- a/docs/cli_reference.md
+++ b/docs/cli_reference.md
@ -18,16 +18,16 @@ subcommands:

 ## Step 1. Get the models

-You first need to have models downloaded locally. 
+You first need to have models downloaded locally.

-To download any model you need the **Model Descriptor**. 
-This can be obtained by running the command 
+To download any model you need the **Model Descriptor**.
+This can be obtained by running the command

 `llama model list`

-You should see a table like this – 
+You should see a table like this –
 ```
-> llama model list 
+> llama model list

 +---------------------------------------+---------------------------------------------+----------------+----------------------------+
 | Model Descriptor                      | HuggingFace Repo                            | Context Length | Hardware Requirements      |
@ -60,18 +60,17 @@ You should see a table like this –
 +---------------------------------------+---------------------------------------------+----------------+----------------------------+
 ```

-To download models, you can use the llama download command. 
+To download models, you can use the llama download command.

-Here is an example downnload command to get the 8B/70B Instruct model 
-you will need a meta url which can be obtained from -- 
+Here is an example download command to get the 8B/70B Instruct model. You will need META_URL which can be obtained from --
 https://llama.meta.com/docs/getting_the_models/meta/
 ```
 llama download --source meta --model-id  Meta-Llama3.1-8B-Instruct --meta-url "<META_URL>"
 llama download --source meta --model-id Meta-Llama3.1-70B-Instruct --meta-url "<META_URL>"
 ```

-You can download from HuggingFace using these commands 
-Set your environment variable HF_TOKEN or pass in --hf-token to the command to validate your access. 
+You can download from HuggingFace using these commands
+Set your environment variable HF_TOKEN or pass in --hf-token to the command to validate your access.
 You can find your token at https://huggingface.co/settings/tokens
 ```
 llama download --source huggingface --model-id  Meta-Llama3.1-8B-Instruct --hf-token <HF_TOKEN>
@ -106,7 +105,7 @@ model_subcommands:
 Example: llama model <subcommand> <options>
 ```

-You can use the describe command to know more about a model 
+You can use the describe command to know more about a model

 ```
 $ llama model describe -m Meta-Llama3.1-8B-Instruct
@ -178,16 +177,31 @@ completed

 These commands can help understand the model interface and how prompts / messages are formatted for various scenarios.

-#NOTE: Outputs in terminal are color printed to show speacial tokens. 
+#NOTE: Outputs in terminal are color printed to show speacial tokens.


-## Step 3: Installing and Configuring Distributions 
+## Step 3: Installing and Configuring Distributions

-A distribution is a collection of APIs that are part of the Llama Stack. Currently we support APIs for inference, safety and agentic_system ( more to be added soon ). A distribution’s behavior can be configured by defining a specification or “spec”. The specification lays out the different API “Providers” that constitute this distribution. Each “Provider” is an implementation of an API and you can group different providers to form a distribution. 
+An agentic app has several components including model inference, tool execution and system safety shields. Running all these components is made simpler (we hope!) with Llama Stack Distributions.

-Lets install, configure and start a distribution to understand more ! 
+A Distribution is simply a collection of REST API providers that are part of the Llama stack. As an example, by running a simple command `llama distribution start`, you can bring up a server serving the following endpoints, among others:
+```
+POST /inference/chat_completion
+POST /inference/completion
+POST /safety/run_shields
+POST /agentic_system/create
+POST /agentic_system/session/create
+POST /agentic_system/turn/create
+POST /agentic_system/delete
+```

-Let’s start with listing available distributions 
+The agentic app can now simply point to this server to execute all its needed components.
+
+A distribution’s behavior can be configured by defining a specification or “spec”. This specification lays out the different API “Providers” that constitute this distribution.
+
+Lets install, configure and start a distribution to understand more !
+
+Let’s start with listing available distributions
 ```
 $ llama distribution list

@ -215,53 +229,48 @@ $ llama distribution list

 ```

-As you can see above, each “spec” details the “providers” that make up that spec. For eg. The inline uses the “meta-reference” provider for inference while the ollama-inline relies on a different provider ( ollama ) for inference. 
+As you can see above, each “spec” details the “providers” that make up that spec. For eg. The inline uses the “meta-reference” provider for inference while the ollama-inline relies on a different provider ( ollama ) for inference.

-Lets install the fully local impl of the llama-stack – aka inline. 
+Lets install the fully local implementation of the llama-stack – named `inline` above.

-To install a distro, we run a simple command providing 2 inputs – 
- Spec Id of the distribution that we want to install ( as obtained from the list command ) 
- A custom name for the specific instance of the distribution that we are going to install.
+To install a distro, we run a simple command providing 2 inputs –
+- **Spec Id** of the distribution that we want to install ( as obtained from the list command )
+- A **Name** by which this installation will be known locally.

 ```
 llama distribution install --spec inline --name inline_llama_8b
 ```

-This will create a new conda environment (name can be passed optionally) and install dependencies (via pip) as required by the distro. 
+This will create a new conda environment (name can be passed optionally) and install dependencies (via pip) as required by the distro.

-Once it runs successfully , you should see some outputs in the form 
+Once it runs successfully , you should see some outputs in the form

 ```
 $ llama distribution install --spec inline --name inline_llama_8b
 ....
-.... 
+....
 Successfully installed cfgv-3.4.0 distlib-0.3.8 identify-2.6.0 libcst-1.4.0 llama_toolchain-0.0.2 moreorless-0.4.0 nodeenv-1.9.1 pre-commit-3.8.0 stdlibs-2024.5.15 toml-0.10.2 tomlkit-0.13.0 trailrunner-1.4.0 ufmt-2.7.0 usort-1.0.8 virtualenv-20.26.3

 Distribution `inline_llama_8b` (with spec inline) has been installed successfully!
-
-Update your conda environment and configure this distribution by running:
-
-conda deactivate && conda activate inline_llama_8b
-llama distribution configure --name inline_llama_8b
 ```

-Next step is to configure the distribution that you just installed. We provide a simple CLI tool to enable simple configuration. 
-This command will walk you through the configuration process. 
-It will ask for some details like model name, paths to models, etc. 
+Next step is to configure the distribution that you just installed. We provide a simple CLI tool to enable simple configuration.
+This command will walk you through the configuration process.
+It will ask for some details like model name, paths to models, etc.

 NOTE: You will have to download the models if not done already. Follow instructions here on how to download using the llama cli
 ```
 llama distribution configure --name inline_llama_8b
 ```

-Here is an example screenshot of how the cli will guide you to fill the configuration 
-``` 
+Here is an example screenshot of how the cli will guide you to fill the configuration
+```
 $ llama distribution configure --name inline_llama_8b

 Configuring API surface: inference
 Enter value for model (required): Meta-Llama3.1-8B-Instruct
-Enter value for quantization (optional): 
-Enter value for torch_seed (optional): 
+Enter value for quantization (optional):
+Enter value for torch_seed (optional):
 Enter value for max_seq_len (required): 4096
 Enter value for max_batch_size (default: 1): 1
 Configuring API surface: safety
@ -272,8 +281,8 @@ Configuring API surface: agentic_system
 YAML configuration has been written to ~/.llama/distributions/inline0/config.yaml
 ```

-As you can see, we did basic configuration above and configured inference to run on model Meta-Llama3.1-8B-Instruct ( obtained from the llama model list command ). 
-For this initial setup we did not set up safety. 
+As you can see, we did basic configuration above and configured inference to run on model Meta-Llama3.1-8B-Instruct ( obtained from the llama model list command ).
+For this initial setup we did not set up safety.

 For how these configurations are stored as yaml, checkout the file printed at the end of the configuration.

@ -281,12 +290,12 @@ For how these configurations are stored as yaml, checkout the file printed at th

 Now let’s start the distribution using the cli.
 ```
-llama distribution start --name inline_llama_8b
+llama distribution start --name inline_llama_8b --port 5000
 ```
 You should see the distribution start and print the APIs that it is supporting,

 ```
-$ llama distribution start --name inline_llama_8b
+$ llama distribution start --name inline_llama_8b --port 5000

 > initializing model parallel with size 1
 > initializing ddp with size 1
@ -316,20 +325,20 @@ INFO:     Application startup complete.
 INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
 ```

-Lets test with a client 
+Lets test with a client

 ```
-cd /path/to/llama-toolchain 
+cd /path/to/llama-toolchain
 conda activate <env-for-distro> # ( Eg. local_inline in above example )

 python -m  llama_toolchain.inference.client localhost 5000
 ```

-This will run the chat completion client and query the distribution’s /inference/chat_completion API. 
+This will run the chat completion client and query the distribution’s /inference/chat_completion API.

-Here is an example output – 
+Here is an example output –
 ```
-python -m  llama_toolchain.inference.client localhost 5000 
+python -m  llama_toolchain.inference.client localhost 5000

 Initializing client for http://localhost:5000
 User>hello world, troll me in two-paragraphs about 42
@ -338,3 +347,9 @@ Assistant> You think you're so smart, don't you? You think you can just waltz in

 You know what's even more hilarious? People like you who think they can just Google "42" and suddenly become experts on the subject. Newsflash: you're not a supercomputer, you're just a human being with a fragile ego and a penchant for thinking you're smarter than you actually are. 42 is just a number, a meaningless collection of digits that holds no significance whatsoever. So go ahead, keep thinking you're so clever, but deep down, you're just a pawn in the grand game of life, and 42 is just a silly little number that's been used to make you feel like you're part of something bigger than yourself. Ha!
 ```
+
+Similarly you can test safety (if you configured llama-guard and/or prompt-guard shields) by:
+
+```
+python -m llama_toolchain.safety.client localhost 5000
+```