12 KiB
Getting Started
The llama
CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the llama-toolchain
package.
This guides allows you to quickly get started with building and running a Llama Stack server in < 5 minutes!
In the following steps, we'll be working with a 8B-Instruct model. Since we are working with a 8B model, we will name our build 8b-instruct
to help us remember the config.
Quick Cheatsheet
- Quick 3 line command to build and start a LlamaStack server using our Meta Reference implementation for all API endpoints.
llama stack build
llama stack build --config ./llama_toolchain/configs/distributions/conda/local-conda-example-build.yaml --name my-local-llama-stack
...
...
Build spec configuration saved at ~/.llama/distributions/conda/my-local-llama-stack-build.yaml
llama stack configure
llama stack configure ~/.llama/distributions/conda/my-local-llama-stack-build.yaml
Configuring API: inference (meta-reference)
Enter value for model (default: Meta-Llama3.1-8B-Instruct) (required):
Enter value for quantization (optional):
Enter value for torch_seed (optional):
Enter value for max_seq_len (required): 4096
Enter value for max_batch_size (default: 1) (required):
Configuring API: memory (meta-reference-faiss)
Configuring API: safety (meta-reference)
Do you want to configure llama_guard_shield? (y/n): n
Do you want to configure prompt_guard_shield? (y/n): n
Configuring API: agentic_system (meta-reference)
Enter value for brave_search_api_key (optional):
Enter value for bing_search_api_key (optional):
Enter value for wolfram_api_key (optional):
Configuring API: telemetry (console)
YAML configuration has been written to ~/.llama/builds/conda/my-local-llama-stack-run.yaml
llama stack run
llama stack run ~/.llama/builds/conda/my-local-llama-stack-run.yaml
...
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
...
Finished model load YES READY
Serving POST /inference/chat_completion
Serving POST /inference/completion
Serving POST /inference/embeddings
Serving POST /memory_banks/create
Serving DELETE /memory_bank/documents/delete
Serving DELETE /memory_banks/drop
Serving GET /memory_bank/documents/get
Serving GET /memory_banks/get
Serving POST /memory_bank/insert
Serving GET /memory_banks/list
Serving POST /memory_bank/query
Serving POST /memory_bank/update
Serving POST /safety/run_shields
Serving POST /agentic_system/create
Serving POST /agentic_system/session/create
Serving POST /agentic_system/turn/create
Serving POST /agentic_system/delete
Serving POST /agentic_system/session/delete
Serving POST /agentic_system/session/get
Serving POST /agentic_system/step/get
Serving POST /agentic_system/turn/get
Serving GET /telemetry/get_trace
Serving POST /telemetry/log_event
Listening on :::5000
INFO: Started server process [587053]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
Step 1. Build
We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify:
name
: the name for our distribution (e.g.8b-instruct
)image_type
: our build image type (conda | docker
)distribution_spec
: our distribution specs for specifying API providersdistribution_type
: an unique name to identify our distribution. The available distributions can be found in llama_toolchain/configs/distributions/distribution_registry folder in the form of YAML files. You can runllama stack list-distributions
to see the available distributions.description
: a short description of the configurations for the distributionproviders
: specifies the underlying implementation for serving each API endpointimage_type
:conda
|docker
to specify whether to build the distribution in the form of Docker image or Conda environment.
Build a local distribution with conda
The following command and specifications allows you to get started with building.
llama stack build
You will be prompted to enter config specifications.
$ llama stack build
Enter value for name (required): 8b-instruct
Entering sub-configuration for distribution_spec:
Enter value for distribution_type (default: local) (required):
Enter value for description (default: Use code from `llama_toolchain` itself to serve all llama stack APIs) (required):
Enter value for docker_image (optional):
Enter value for providers (default: {'inference': 'meta-reference', 'memory': 'meta-reference-faiss', 'safety': 'meta-reference', 'agentic_system': 'meta-reference', 'telemetry': 'console'}) (required):
Enter value for image_type (default: conda) (required):
Conda environment 'llamastack-8b-instruct' exists. Checking Python version...
Build spec configuration saved at ~/.llama/distributions/conda/8b-instruct-build.yaml
After this step is complete, a file named 8b-instruct-build.yaml
will be generated and saved at ~/.llama/distributions/conda/8b-instruct-build.yaml
.
The file will be of the contents
$ cat ~/.llama/distributions/conda/8b-instruct-build.yaml
name: 8b-instruct
distribution_spec:
distribution_type: local
description: Use code from `llama_toolchain` itself to serve all llama stack APIs
docker_image: null
providers:
inference: meta-reference
memory: meta-reference-faiss
safety: meta-reference
agentic_system: meta-reference
telemetry: console
image_type: conda
You may edit the 8b-instruct-build.yaml
file and re-run the llama stack build
command to re-build and update the distribution.
llama stack build --config ~/.llama/distributions/conda/8b-instruct-build.yaml
How to build distribution with different API providers using configs
To specify a different API provider, we can change the distribution_spec
in our <name>-build.yaml
config. For example, the following build spec allows you to build a distribution using TGI as the inference API provider.
$ cat ./llama_toolchain/configs/distributions/conda/local-tgi-conda-example-build.yaml
name: local-tgi-conda-example
distribution_spec:
distribution_type: local-plus-tgi-inference
description: Use TGI (local or with Hugging Face Inference Endpoints for running LLM inference. When using HF Inference Endpoints, you must provide the name of the endpoint).
docker_image: null
providers:
inference: remote::tgi
memory: meta-reference-faiss
safety: meta-reference
agentic_system: meta-reference
telemetry: console
image_type: conda
The following command allows you to build a distribution with TGI as the inference API provider, with the name tgi
.
llama stack build --config ./llama_toolchain/configs/distributions/conda/local-tgi-conda-example-build.yaml --name tgi
We provide some example build configs to help you get started with building with different API providers.
How to build distribution with Docker image
To build a docker image, simply change the image_type
to docker
in our <name>-build.yaml
file, and run llama stack build --config <name>-build.yaml
.
$ cat ./llama_toolchain/configs/distributions/docker/local-docker-example-build.yaml
name: local-docker-example
distribution_spec:
distribution_type: local
description: Use code from `llama_toolchain` itself to serve all llama stack APIs
docker_image: null
providers:
inference: meta-reference
memory: meta-reference-faiss
safety: meta-reference
agentic_system: meta-reference
telemetry: console
image_type: docker
The following command allows you to build a Docker image with the name docker-local
llama stack build --config ./llama_toolchain/configs/distributions/docker/local-docker-example-build.yaml --name docker-local
Dockerfile created successfully in /tmp/tmp.I0ifS2c46A/DockerfileFROM python:3.10-slim
WORKDIR /app
...
...
You can run it with: podman run -p 8000:8000 llamastack-docker-local
Build spec configuration saved at /home/xiyan/.llama/distributions/docker/docker-local-build.yaml
Step 2. Configure
After our distribution is built (either in form of docker or conda environment), we will run the following command to
llama stack configure [<path/to/name.build.yaml> | <docker-image-name>]
- For
conda
environments: <path/to/name.build.yaml> would be the generated build spec saved from Step 1. - For
docker
images downloaded from Dockerhub, you could also use as the argument.- Run
docker images
to check list of available images on your machine.
- Run
$ llama stack configure ~/.llama/distributions/conda/8b-instruct-build.yaml
Configuring API: inference (meta-reference)
Enter value for model (existing: Meta-Llama3.1-8B-Instruct) (required):
Enter value for quantization (optional):
Enter value for torch_seed (optional):
Enter value for max_seq_len (existing: 4096) (required):
Enter value for max_batch_size (existing: 1) (required):
Configuring API: memory (meta-reference-faiss)
Configuring API: safety (meta-reference)
Do you want to configure llama_guard_shield? (y/n): y
Entering sub-configuration for llama_guard_shield:
Enter value for model (default: Llama-Guard-3-8B) (required):
Enter value for excluded_categories (default: []) (required):
Enter value for disable_input_check (default: False) (required):
Enter value for disable_output_check (default: False) (required):
Do you want to configure prompt_guard_shield? (y/n): y
Entering sub-configuration for prompt_guard_shield:
Enter value for model (default: Prompt-Guard-86M) (required):
Configuring API: agentic_system (meta-reference)
Enter value for brave_search_api_key (optional):
Enter value for bing_search_api_key (optional):
Enter value for wolfram_api_key (optional):
Configuring API: telemetry (console)
YAML configuration has been written to ~/.llama/builds/conda/8b-instruct-run.yaml
After this step is successful, you should be able to find a run configuration spec in ~/.llama/builds/conda/8b-instruct-run.yaml
with the following contents. You may efit this file to change the settings.
Step 3. Run
Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the llama stack configure
step.
llama stack run ~/.llama/builds/conda/8b-instruct-run.yaml
You should see the Llama Stack server start and print the APIs that it is supporting
$ llama stack run ~/.llama/builds/local/conda/8b-instruct.yaml
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 19.28 seconds
NCCL version 2.20.5+cuda12.4
Finished model load YES READY
Serving POST /inference/batch_chat_completion
Serving POST /inference/batch_completion
Serving POST /inference/chat_completion
Serving POST /inference/completion
Serving POST /safety/run_shields
Serving POST /agentic_system/memory_bank/attach
Serving POST /agentic_system/create
Serving POST /agentic_system/session/create
Serving POST /agentic_system/turn/create
Serving POST /agentic_system/delete
Serving POST /agentic_system/session/delete
Serving POST /agentic_system/memory_bank/detach
Serving POST /agentic_system/session/get
Serving POST /agentic_system/step/get
Serving POST /agentic_system/turn/get
Listening on :::5000
INFO: Started server process [453333]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
Note
Configuration is in
~/.llama/builds/local/conda/8b-instruct.yaml
. Feel free to increasemax_seq_len
.
Important
The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines. This server is running a Llama model locally.