mirror of https://github.com/meta-llama/llama-stack.git synced 2025-10-04 12:07:34 +00:00

2024-09-15 11:15:55 -07:00

12 KiB

Raw Blame History

Getting Started

The llama CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the llama-toolchain package.

This guides allows you to quickly get started with building and running a Llama Stack server in < 5 minutes!

In the following steps, we'll be working with a 8B-Instruct model. Since we are working with a 8B model, we will name our build 8b-instruct to help us remember the config.

Quick Cheatsheet

Quick 3 line command to build and start a LlamaStack server using our Meta Reference implementation for all API endpoints.

llama stack build

llama stack build --config ./llama_toolchain/configs/distributions/conda/local-conda-example-build.yaml --name my-local-llama-stack
...
...
Build spec configuration saved at ~/.llama/distributions/conda/my-local-llama-stack-build.yaml

llama stack configure

llama stack configure ~/.llama/distributions/conda/my-local-llama-stack-build.yaml

Configuring API: inference (meta-reference)
Enter value for model (default: Meta-Llama3.1-8B-Instruct) (required):
Enter value for quantization (optional):
Enter value for torch_seed (optional):
Enter value for max_seq_len (required): 4096
Enter value for max_batch_size (default: 1) (required):

Configuring API: memory (meta-reference-faiss)

Configuring API: safety (meta-reference)
Do you want to configure llama_guard_shield? (y/n): n
Do you want to configure prompt_guard_shield? (y/n): n

Configuring API: agentic_system (meta-reference)
Enter value for brave_search_api_key (optional):
Enter value for bing_search_api_key (optional):
Enter value for wolfram_api_key (optional):

Configuring API: telemetry (console)

YAML configuration has been written to ~/.llama/builds/conda/my-local-llama-stack-run.yaml

llama stack run

llama stack run ~/.llama/builds/conda/my-local-llama-stack-run.yaml

...
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
...
Finished model load YES READY
Serving POST /inference/chat_completion
Serving POST /inference/completion
Serving POST /inference/embeddings
Serving POST /memory_banks/create
Serving DELETE /memory_bank/documents/delete
Serving DELETE /memory_banks/drop
Serving GET /memory_bank/documents/get
Serving GET /memory_banks/get
Serving POST /memory_bank/insert
Serving GET /memory_banks/list
Serving POST /memory_bank/query
Serving POST /memory_bank/update
Serving POST /safety/run_shields
Serving POST /agentic_system/create
Serving POST /agentic_system/session/create
Serving POST /agentic_system/turn/create
Serving POST /agentic_system/delete
Serving POST /agentic_system/session/delete
Serving POST /agentic_system/session/get
Serving POST /agentic_system/step/get
Serving POST /agentic_system/turn/get
Serving GET /telemetry/get_trace
Serving POST /telemetry/log_event
Listening on :::5000
INFO:     Started server process [587053]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)

Step 1. Build

We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify:

name: the name for our distribution (e.g. 8b-instruct)
image_type: our build image type (conda | docker)
distribution_spec: our distribution specs for specifying API providers
- distribution_type: an unique name to identify our distribution. The available distributions can be found in llama_toolchain/configs/distributions/distribution_registry folder in the form of YAML files. You can run llama stack list-distributions to see the available distributions.
- description: a short description of the configurations for the distribution
- providers: specifies the underlying implementation for serving each API endpoint
- image_type: conda | docker to specify whether to build the distribution in the form of Docker image or Conda environment.

Build a local distribution with conda

The following command and specifications allows you to get started with building.

llama stack build

You will be prompted to enter config specifications.

$ llama stack build

Enter value for name (required): 8b-instruct

Entering sub-configuration for distribution_spec:
Enter value for distribution_type (default: local) (required):
Enter value for description (default: Use code from `llama_toolchain` itself to serve all llama stack APIs) (required):
Enter value for docker_image (optional):
Enter value for providers (default: {'inference': 'meta-reference', 'memory': 'meta-reference-faiss', 'safety': 'meta-reference', 'agentic_system': 'meta-reference', 'telemetry': 'console'}) (required):
Enter value for image_type (default: conda) (required):

Conda environment 'llamastack-8b-instruct' exists. Checking Python version...

Build spec configuration saved at ~/.llama/distributions/conda/8b-instruct-build.yaml

After this step is complete, a file named 8b-instruct-build.yaml will be generated and saved at ~/.llama/distributions/conda/8b-instruct-build.yaml.

The file will be of the contents

$ cat ~/.llama/distributions/conda/8b-instruct-build.yaml

name: 8b-instruct
distribution_spec:
  distribution_type: local
  description: Use code from `llama_toolchain` itself to serve all llama stack APIs
  docker_image: null
  providers:
    inference: meta-reference
    memory: meta-reference-faiss
    safety: meta-reference
    agentic_system: meta-reference
    telemetry: console
image_type: conda

You may edit the 8b-instruct-build.yaml file and re-run the llama stack build command to re-build and update the distribution.

llama stack build --config ~/.llama/distributions/conda/8b-instruct-build.yaml

How to build distribution with different API providers using configs

To specify a different API provider, we can change the distribution_spec in our <name>-build.yaml config. For example, the following build spec allows you to build a distribution using TGI as the inference API provider.

$ cat ./llama_toolchain/configs/distributions/conda/local-tgi-conda-example-build.yaml

name: local-tgi-conda-example
distribution_spec:
  distribution_type: local-plus-tgi-inference
  description: Use TGI (local or with Hugging Face Inference Endpoints for running LLM inference. When using HF Inference Endpoints, you must provide the name of the endpoint).
  docker_image: null
  providers:
    inference: remote::tgi
    memory: meta-reference-faiss
    safety: meta-reference
    agentic_system: meta-reference
    telemetry: console
image_type: conda

The following command allows you to build a distribution with TGI as the inference API provider, with the name tgi.

llama stack build --config ./llama_toolchain/configs/distributions/conda/local-tgi-conda-example-build.yaml --name tgi

We provide some example build configs to help you get started with building with different API providers.

How to build distribution with Docker image

To build a docker image, simply change the image_type to docker in our <name>-build.yaml file, and run llama stack build --config <name>-build.yaml.

$ cat ./llama_toolchain/configs/distributions/docker/local-docker-example-build.yaml

name: local-docker-example
distribution_spec:
  distribution_type: local
  description: Use code from `llama_toolchain` itself to serve all llama stack APIs
  docker_image: null
  providers:
    inference: meta-reference
    memory: meta-reference-faiss
    safety: meta-reference
    agentic_system: meta-reference
    telemetry: console
image_type: docker

The following command allows you to build a Docker image with the name docker-local

llama stack build --config ./llama_toolchain/configs/distributions/docker/local-docker-example-build.yaml --name docker-local

Dockerfile created successfully in /tmp/tmp.I0ifS2c46A/DockerfileFROM python:3.10-slim
WORKDIR /app
...
...
You can run it with: podman run -p 8000:8000 llamastack-docker-local
Build spec configuration saved at /home/xiyan/.llama/distributions/docker/docker-local-build.yaml

Step 2. Configure

After our distribution is built (either in form of docker or conda environment), we will run the following command to

llama stack configure [<path/to/name.build.yaml> | <docker-image-name>]

For conda environments: <path/to/name.build.yaml> would be the generated build spec saved from Step 1.
For docker images downloaded from Dockerhub, you could also use as the argument.
- Run docker images to check list of available images on your machine.

$ llama stack configure ~/.llama/distributions/conda/8b-instruct-build.yaml

Configuring API: inference (meta-reference)
Enter value for model (existing: Meta-Llama3.1-8B-Instruct) (required):
Enter value for quantization (optional):
Enter value for torch_seed (optional):
Enter value for max_seq_len (existing: 4096) (required):
Enter value for max_batch_size (existing: 1) (required):

Configuring API: memory (meta-reference-faiss)

Configuring API: safety (meta-reference)
Do you want to configure llama_guard_shield? (y/n): y
Entering sub-configuration for llama_guard_shield:
Enter value for model (default: Llama-Guard-3-8B) (required):
Enter value for excluded_categories (default: []) (required):
Enter value for disable_input_check (default: False) (required):
Enter value for disable_output_check (default: False) (required):
Do you want to configure prompt_guard_shield? (y/n): y
Entering sub-configuration for prompt_guard_shield:
Enter value for model (default: Prompt-Guard-86M) (required):

Configuring API: agentic_system (meta-reference)
Enter value for brave_search_api_key (optional):
Enter value for bing_search_api_key (optional):
Enter value for wolfram_api_key (optional):

Configuring API: telemetry (console)

YAML configuration has been written to ~/.llama/builds/conda/8b-instruct-run.yaml

After this step is successful, you should be able to find a run configuration spec in ~/.llama/builds/conda/8b-instruct-run.yaml with the following contents. You may efit this file to change the settings.

Step 3. Run

Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the llama stack configure step.

llama stack run ~/.llama/builds/conda/8b-instruct-run.yaml

You should see the Llama Stack server start and print the APIs that it is supporting

$ llama stack run ~/.llama/builds/local/conda/8b-instruct.yaml

> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Loaded in 19.28 seconds
NCCL version 2.20.5+cuda12.4
Finished model load YES READY
Serving POST /inference/batch_chat_completion
Serving POST /inference/batch_completion
Serving POST /inference/chat_completion
Serving POST /inference/completion
Serving POST /safety/run_shields
Serving POST /agentic_system/memory_bank/attach
Serving POST /agentic_system/create
Serving POST /agentic_system/session/create
Serving POST /agentic_system/turn/create
Serving POST /agentic_system/delete
Serving POST /agentic_system/session/delete
Serving POST /agentic_system/memory_bank/detach
Serving POST /agentic_system/session/get
Serving POST /agentic_system/step/get
Serving POST /agentic_system/turn/get
Listening on :::5000
INFO:     Started server process [453333]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)

Note

Configuration is in ~/.llama/builds/local/conda/8b-instruct.yaml. Feel free to increase max_seq_len.

Important

The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines. This server is running a Llama model locally.

12 KiB Raw Blame History