llama-stack-mirror/llama_stack/distributions/meta-reference-gpu/doc_template.md
Sébastien Han 7ee0ee7843
chore!: remove model mgmt from CLI for Hugging Face CLI (#3700)
This change removes the `llama model` and `llama download` subcommands
from the CLI, replacing them with recommendations to use the Hugging
Face CLI instead.

Rationale for this change:
- The model management functionality was largely duplicating what
Hugging Face CLI already provides, leading to unnecessary maintenance
overhead (except the download source from Meta?)
- Maintaining our own implementation required fixing bugs and keeping up
with changes in model repositories and download mechanisms
- The Hugging Face CLI is more mature, widely adopted, and better
maintained
- This allows us to focus on the core Llama Stack functionality rather
than reimplementing model management tools

Changes made:
- Removed all model-related CLI commands and their implementations
- Updated documentation to recommend using `huggingface-cli` for model
downloads
- Removed Meta-specific download logic and statements
- Simplified the CLI to focus solely on stack management operations

Users should now use:
- `huggingface-cli download` for downloading models
- `huggingface-cli scan-cache` for listing downloaded models

This is a breaking change as it removes previously available CLI
commands.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-10-09 16:50:33 -07:00

2.3 KiB

orphan
true

Meta Reference GPU Distribution

:maxdepth: 2
:hidden:

self

The llamastack/distribution-{{ name }} distribution consists of the following provider configurations:

{{ providers_table }}

Note that you need access to nvidia GPUs to run this distribution. This distribution is not compatible with CPU-only machines or machines with AMD GPUs.

{% if run_config_env_vars %}

Environment Variables

The following environment variables can be configured:

{% for var, (default_value, description) in run_config_env_vars.items() %}

  • {{ var }}: {{ description }} (default: {{ default_value }}) {% endfor %} {% endif %}

Prerequisite: Downloading Models

Please check that you have llama model checkpoints downloaded in ~/.llama before proceeding. See installation guide here to download the models using the Hugging Face CLI.


## Running the Distribution

You can do this via venv or Docker which has a pre-built image.

### Via Docker

This method allows you to get started quickly without having to build the distribution code.

```bash
LLAMA_STACK_PORT=8321
docker run \
  -it \
  --pull always \
  --gpu all \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  -e INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
  llamastack/distribution-{{ name }} \
  --port $LLAMA_STACK_PORT

If you are using Llama Stack Safety / Shield APIs, use:

docker run \
  -it \
  --pull always \
  --gpu all \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  -e INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
  -e SAFETY_MODEL=meta-llama/Llama-Guard-3-1B \
  llamastack/distribution-{{ name }} \
  --port $LLAMA_STACK_PORT

Via venv

Make sure you have done uv pip install llama-stack and have the Llama Stack CLI available.

llama stack build --distro {{ name }} --image-type venv
INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
llama stack run distributions/{{ name }}/run.yaml \
  --port 8321

If you are using Llama Stack Safety / Shield APIs, use:

INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
SAFETY_MODEL=meta-llama/Llama-Guard-3-1B \
llama stack run distributions/{{ name }}/run-with-safety.yaml \
  --port 8321