diff --git a/docs/getting_started.md b/docs/getting_started.md index 72b4057e3..56b2e42f4 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -3,285 +3,3 @@ The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-toolchain` package. This guides allows you to quickly get started with building and running a Llama Stack server in < 5 minutes! - -### TL;DR -Let's imagine you are working with a 8B-Instruct model. We will name our build `8b-instruct` to help us remember the config. - -**llama stack build** -``` -llama stack build - -Enter value for name (required): 8b-instruct -Enter value for distribution (default: local) (required): -Enter value for api_providers (optional): -Enter value for image_type (default: conda) (required): - -... -Build spec configuration saved at ~/.llama/distributions/local/docker/8b-instruct-build.yaml -``` - -**llama stack configure** -``` -$ llama stack configure ~/.llama/distributions/local/docker/8b-instruct-build.yaml -Configuring API: inference (meta-reference) -Enter value for model (default: Meta-Llama3.1-8B-Instruct) (required): -Enter value for quantization (optional): -Enter value for torch_seed (optional): -Enter value for max_seq_len (required): 4096 -Enter value for max_batch_size (default: 1) (required): - -Configuring API: memory (meta-reference-faiss) - -Configuring API: safety (meta-reference) -Do you want to configure llama_guard_shield? (y/n): n -Do you want to configure prompt_guard_shield? (y/n): n - -Configuring API: agentic_system (meta-reference) -Enter value for brave_search_api_key (optional): -Enter value for bing_search_api_key (optional): -Enter value for wolfram_api_key (optional): - -Configuring API: telemetry (console) - -YAML configuration has been written to ~/.llama/builds/local/docker/8b-instruct-build.yaml -``` - -**llama stack run** -``` -llama stack run ~/.llama/builds/local/docker/8b-instruct-build.yaml -... -Serving POST /inference/chat_completion -Serving POST /inference/completion -Serving POST /inference/embeddings -Serving POST /memory_banks/create -Serving DELETE /memory_bank/documents/delete -Serving DELETE /memory_banks/drop -Serving GET /memory_bank/documents/get -Serving GET /memory_banks/get -Serving POST /memory_bank/insert -Serving GET /memory_banks/list -Serving POST /memory_bank/query -Serving POST /memory_bank/update -Serving POST /safety/run_shields -Serving POST /agentic_system/create -Serving POST /agentic_system/session/create -Serving POST /agentic_system/turn/create -Serving POST /agentic_system/delete -Serving POST /agentic_system/session/delete -Serving POST /agentic_system/session/get -Serving POST /agentic_system/step/get -Serving POST /agentic_system/turn/get -Serving GET /telemetry/get_trace -Serving POST /telemetry/log_event -Listening on :::5000 -INFO: Started server process [3403915] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) -``` - -### Step 0. Prerequisites -You first need to have models downloaded locally. You can see this page for downloading models - - -You should see a table like this: - -
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Model Descriptor                      | HuggingFace Repo                            | Context Length | Hardware Requirements      |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Meta-Llama3.1-8B                      | meta-llama/Meta-Llama-3.1-8B                | 128K           | 1 GPU, each >= 20GB VRAM   |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Meta-Llama3.1-70B                     | meta-llama/Meta-Llama-3.1-70B               | 128K           | 8 GPUs, each >= 20GB VRAM  |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Meta-Llama3.1-405B:bf16-mp8           |                                             | 128K           | 8 GPUs, each >= 120GB VRAM |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Meta-Llama3.1-405B                    | meta-llama/Meta-Llama-3.1-405B-FP8          | 128K           | 8 GPUs, each >= 70GB VRAM  |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Meta-Llama3.1-405B:bf16-mp16          | meta-llama/Meta-Llama-3.1-405B              | 128K           | 16 GPUs, each >= 70GB VRAM |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Meta-Llama3.1-8B-Instruct             | meta-llama/Meta-Llama-3.1-8B-Instruct       | 128K           | 1 GPU, each >= 20GB VRAM   |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Meta-Llama3.1-70B-Instruct            | meta-llama/Meta-Llama-3.1-70B-Instruct      | 128K           | 8 GPUs, each >= 20GB VRAM  |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Meta-Llama3.1-405B-Instruct:bf16-mp8  |                                             | 128K           | 8 GPUs, each >= 120GB VRAM |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Meta-Llama3.1-405B-Instruct           | meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 | 128K           | 8 GPUs, each >= 70GB VRAM  |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Meta-Llama3.1-405B-Instruct:bf16-mp16 | meta-llama/Meta-Llama-3.1-405B-Instruct     | 128K           | 16 GPUs, each >= 70GB VRAM |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Llama-Guard-3-8B                      | meta-llama/Llama-Guard-3-8B                 | 128K           | 1 GPU, each >= 20GB VRAM   |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Llama-Guard-3-8B:int8-mp1             | meta-llama/Llama-Guard-3-8B-INT8            | 128K           | 1 GPU, each >= 10GB VRAM   |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-| Prompt-Guard-86M                      | meta-llama/Prompt-Guard-86M                 | 128K           | 1 GPU, each >= 1GB VRAM    |
-+---------------------------------------+---------------------------------------------+----------------+----------------------------+
-
- -To download models, you can use the llama download command. - -Here is an example download command to get the 8B/70B Instruct model. You will need META_URL which can be obtained from [here](https://llama.meta.com/docs/getting_the_models/meta/) -``` -llama download --source meta --model-id Meta-Llama3.1-8B-Instruct --meta-url -``` -``` -llama download --source meta --model-id Meta-Llama3.1-70B-Instruct --meta-url -``` - -### Step 1. Build - -##### Build conda -Let's imagine you are working with a 8B-Instruct model. The following command will build a package (in the form of a Conda environment). Since we are working with a 8B model, we will name our build `8b-instruct` to help us remember the config. - - -``` -llama stack build -``` - -``` -$ llama stack build - -Enter value for name (required): 8b-instruct -Enter value for distribution (default: local) (required): -Enter value for api_providers (optional): -Enter value for image_type (default: conda) (required): - -.... -.... -Successfully installed cfgv-3.4.0 distlib-0.3.8 identify-2.6.0 libcst-1.4.0 llama_toolchain-0.0.2 moreorless-0.4.0 nodeenv-1.9.1 pre-commit-3.8.0 stdlibs-2024.5.15 toml-0.10.2 tomlkit-0.13.0 trailrunner-1.4.0 ufmt-2.7.0 usort-1.0.8 virtualenv-20.26.3 -... -... -Build spec configuration saved at /home/xiyan/.llama/distributions/local/conda/8b-instruct-build.yaml -``` - -##### Build docker -The following command will build a package (in the form of a Docker container). Since we are working with a 8B model, we will name our build `8b-instruct` to help us remember the config. We will specify the `image_type` as `docker` to build Docker container. - -``` -$ llama stack build - -Enter value for name (required): 8b-instruct -Enter value for distribution (default: local) (required): -Enter value for api_providers (optional): -Enter value for image_type (default: conda) (required): docker - -... -... -COMMIT llamastack-d ---> a319efac9f0a -Successfully tagged localhost/llamastack-d:latest -a319efac9f0a488d18662b90efdb863df6c1a2c9cffaea6e247e4abd90b1bfc2 -+ set +x -Succesfully setup Podman image. Configuring build...You can run it with: podman run -p 8000:8000 llamastack-d -Build spec configuration saved at /home/xiyan/.llama/distributions/local/docker/d-build.yaml -``` - -##### Re-build from config -You can re-build package based on build config -``` -$ cat ~/.llama/distributions/local/conda/8b-instruct-build.yaml -name: 8b-instruct -distribution: local -api_providers: null -image_type: conda - -$ llama stack build --config ~/.llama/distributions/local/conda/8b-instruct-build.yaml - -Successfully setup conda environment. Configuring build... - -... -... -Build spec configuration saved at ~/.llama/distributions/local/conda/8b-instruct-build.yaml -``` - -### Step 2. Configure - -Next, you will need to configure the distribution to specify run settings for running the server. As part of the configuration, you will be asked for some inputs (model_id, max_seq_len, etc.) You should configure this distribution by running: -``` -llama stack configure ~/.llama/builds/local/conda/8b-instruct-build.yaml -``` - -Here is an example run of how the CLI will guide you to fill the configuration - -``` -$ llama stack configure ~/.llama/builds/local/conda/8b-instruct-build.yaml - -Configuring API: inference (meta-reference) -Enter value for model (required): Meta-Llama3.1-8B-Instruct -Enter value for quantization (optional): -Enter value for torch_seed (optional): -Enter value for max_seq_len (required): 4096 -Enter value for max_batch_size (default: 1): 1 -Configuring API: safety (meta-reference) -Do you want to configure llama_guard_shield? (y/n): y -Entering sub-configuration for llama_guard_shield: -Enter value for model (required): Llama-Guard-3-8B -Enter value for excluded_categories (required): [] -Enter value for disable_input_check (default: False): -Enter value for disable_output_check (default: False): -Do you want to configure prompt_guard_shield? (y/n): y -Entering sub-configuration for prompt_guard_shield: -Enter value for model (required): Prompt-Guard-86M -... -... -YAML configuration has been written to ~/.llama/builds/local/conda/8b-instruct.yaml -``` - -As you can see, we did basic configuration above and configured: -- inference to run on model `Meta-Llama3.1-8B-Instruct` (obtained from `llama model list`) -- Llama Guard safety shield with model `Llama-Guard-3-8B` -- Prompt Guard safety shield with model `Prompt-Guard-86M` - -For how these configurations are stored as yaml, checkout the file printed at the end of the configuration. - -Note that all configurations as well as models are stored in `~/.llama` - -### Step 3. Run - -Now let’s start Llama Stack Distribution Server. - -You need the YAML configuration file which was written out at the end by the `llama stack configure` step. - -``` -llama stack run ~/.llama/builds/local/conda/8b-instruct.yaml -``` -You should see the Stack server start and print the APIs that it is supporting, - -``` -$ llama stack run ~/.llama/builds/local/conda/8b-instruct.yaml - -> initializing model parallel with size 1 -> initializing ddp with size 1 -> initializing pipeline with size 1 -Loaded in 19.28 seconds -NCCL version 2.20.5+cuda12.4 -Finished model load YES READY -Serving POST /inference/batch_chat_completion -Serving POST /inference/batch_completion -Serving POST /inference/chat_completion -Serving POST /inference/completion -Serving POST /safety/run_shields -Serving POST /agentic_system/memory_bank/attach -Serving POST /agentic_system/create -Serving POST /agentic_system/session/create -Serving POST /agentic_system/turn/create -Serving POST /agentic_system/delete -Serving POST /agentic_system/session/delete -Serving POST /agentic_system/memory_bank/detach -Serving POST /agentic_system/session/get -Serving POST /agentic_system/step/get -Serving POST /agentic_system/turn/get -Listening on :::5000 -INFO: Started server process [453333] -INFO: Waiting for application startup. -INFO: Application startup complete. -INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) -``` - -> [!NOTE] -> Configuration is in `~/.llama/builds/local/conda/8b-instruct.yaml`. Feel free to increase `max_seq_len`. - -> [!IMPORTANT] -> The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines. - -This server is running a Llama model locally. diff --git a/llama_toolchain/cli/stack/build.py b/llama_toolchain/cli/stack/build.py index 0fbfe95a4..7c0627e1d 100644 --- a/llama_toolchain/cli/stack/build.py +++ b/llama_toolchain/cli/stack/build.py @@ -30,6 +30,12 @@ class StackBuild(Subcommand): help="Path to a config file to use for the build", ) + self.parser.add_argument( + "--name", + type=str, + help="Override the name of the llama stack build", + ) + def _run_stack_build_command_from_build_config( self, build_config: BuildConfig ) -> None: @@ -68,8 +74,9 @@ class StackBuild(Subcommand): except Exception as e: self.parser.error(f"Could not parse config file {args.config}: {e}") return + build_config.name = args.name if args.name else build_config.name self._run_stack_build_command_from_build_config(build_config) return - build_config = prompt_for_config(BuildConfig, None) + build_config = prompt_for_config(BuildConfig, build_config_default) self._run_stack_build_command_from_build_config(build_config)