Integrate distro docs into the restructured docs

This commit is contained in:
Ashwin Bharambe 2024-11-20 23:20:05 -08:00
parent 2411a44833
commit cd6ccb664c
17 changed files with 306 additions and 115 deletions

View file

@ -222,6 +222,40 @@
"sentence-transformers --no-deps", "sentence-transformers --no-deps",
"torch --index-url https://download.pytorch.org/whl/cpu" "torch --index-url https://download.pytorch.org/whl/cpu"
], ],
"meta-reference-quantized-gpu": [
"accelerate",
"aiosqlite",
"blobfile",
"chardet",
"chromadb-client",
"fairscale",
"faiss-cpu",
"fastapi",
"fbgemm-gpu",
"fire",
"httpx",
"lm-format-enforcer",
"matplotlib",
"nltk",
"numpy",
"pandas",
"pillow",
"psycopg2-binary",
"pypdf",
"redis",
"scikit-learn",
"scipy",
"sentencepiece",
"torch",
"torchao==0.5.0",
"torchvision",
"tqdm",
"transformers",
"uvicorn",
"zmq",
"sentence-transformers --no-deps",
"torch --index-url https://download.pytorch.org/whl/cpu"
],
"ollama": [ "ollama": [
"aiohttp", "aiohttp",
"aiosqlite", "aiosqlite",

View file

@ -1,4 +1,5 @@
# Bedrock Distribution # Bedrock Distribution
```{toctree} ```{toctree}
:maxdepth: 2 :maxdepth: 2
:hidden: :hidden:

View file

@ -7,55 +7,86 @@
self self
``` ```
The `llamastack/distribution-meta-reference-quantized-gpu` distribution consists of the following provider configurations. The `llamastack/distribution-meta-reference-quantized-gpu` distribution consists of the following provider configurations:
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| inference | `inline::meta-reference-quantized` |
| memory | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
| safety | `inline::llama-guard` |
| telemetry | `inline::meta-reference` |
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|----------------- |------------------------ |---------------- |-------------------------------------------------- |---------------- |---------------- |
| **Provider(s)** | meta-reference-quantized | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
The only difference vs. the `meta-reference-gpu` distribution is that it has support for more efficient inference -- with fp8, int4 quantization, etc. The only difference vs. the `meta-reference-gpu` distribution is that it has support for more efficient inference -- with fp8, int4 quantization, etc.
### Step 0. Prerequisite - Downloading Models Note that you need access to nvidia GPUs to run this distribution. This distribution is not compatible with CPU-only machines or machines with AMD GPUs.
Please make sure you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](https://llama-stack.readthedocs.io/en/latest/cli_reference/download_models.html) here to download the models.
### Environment Variables
The following environment variables can be configured:
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
- `INFERENCE_MODEL`: Inference model loaded into the Meta Reference server (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `INFERENCE_CHECKPOINT_DIR`: Directory containing the Meta Reference model checkpoint (default: `null`)
## Prerequisite: Downloading Models
Please make sure you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](https://llama-stack.readthedocs.io/en/latest/cli_reference/download_models.html) here to download the models. Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
``` ```
$ ls ~/.llama/checkpoints $ ls ~/.llama/checkpoints
Llama3.2-3B-Instruct:int4-qlora-eo8 Llama3.1-8B Llama3.2-11B-Vision-Instruct Llama3.2-1B-Instruct Llama3.2-90B-Vision-Instruct Llama-Guard-3-8B
Llama3.1-8B-Instruct Llama3.2-1B Llama3.2-3B-Instruct Llama-Guard-3-1B Prompt-Guard-86M
``` ```
### Step 1. Start the Distribution ## Running the Distribution
#### (Option 1) Start with Docker
``` You can do this via Conda (build code) or Docker which has a pre-built image.
$ cd distributions/meta-reference-quantized-gpu && docker compose up
### Via Docker
This method allows you to get started quickly without having to build the distribution code.
```bash
LLAMA_STACK_PORT=5001
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
llamastack/distribution-meta-reference-quantized-gpu \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
``` ```
> [!NOTE] If you are using Llama Stack Safety / Shield APIs, use:
> This assumes you have access to GPU to start a local server with access to your GPU.
```bash
> [!NOTE] docker run \
> `~/.llama` should be the path containing downloaded weights of Llama models. -it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
llamastack/distribution-meta-reference-quantized-gpu \
This will download and start running a pre-built docker container. Alternatively, you may use the following commands: --port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
``` --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-quantized-gpu --yaml_config /root/my-run.yaml
``` ```
#### (Option 2) Start with Conda ### Via Conda
1. Install the `llama` CLI. See [CLI Reference](https://llama-stack.readthedocs.io/en/latest/cli_reference/index.html) Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.
2. Build the `meta-reference-quantized-gpu` distribution ```bash
llama stack build --template meta-reference-quantized-gpu --image-type conda
``` llama stack run distributions/meta-reference-quantized-gpu/run.yaml \
$ llama stack build --template meta-reference-quantized-gpu --image-type conda --port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
``` ```
3. Start running distribution If you are using Llama Stack Safety / Shield APIs, use:
```
$ cd distributions/meta-reference-quantized-gpu ```bash
$ llama stack run ./run.yaml llama stack run distributions/meta-reference-quantized-gpu/run-with-safety.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
--env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
``` ```

View file

@ -1,5 +1,4 @@
# Remote vLLM Distribution # Remote vLLM Distribution
```{toctree} ```{toctree}
:maxdepth: 2 :maxdepth: 2
:hidden: :hidden:

View file

@ -4,7 +4,7 @@
# This source code is licensed under the terms described in the LICENSE file in # This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree. # the root directory of this source tree.
from typing import Optional from typing import Any, Dict, Optional
from llama_models.datatypes import * # noqa: F403 from llama_models.datatypes import * # noqa: F403
from llama_models.sku_list import resolve_model from llama_models.sku_list import resolve_model
@ -56,6 +56,7 @@ class MetaReferenceInferenceConfig(BaseModel):
cls, cls,
model: str = "Llama3.2-3B-Instruct", model: str = "Llama3.2-3B-Instruct",
checkpoint_dir: str = "${env.CHECKPOINT_DIR:null}", checkpoint_dir: str = "${env.CHECKPOINT_DIR:null}",
**kwargs,
) -> Dict[str, Any]: ) -> Dict[str, Any]:
return { return {
"model": model, "model": model,
@ -66,3 +67,16 @@ class MetaReferenceInferenceConfig(BaseModel):
class MetaReferenceQuantizedInferenceConfig(MetaReferenceInferenceConfig): class MetaReferenceQuantizedInferenceConfig(MetaReferenceInferenceConfig):
quantization: QuantizationConfig quantization: QuantizationConfig
@classmethod
def sample_run_config(
cls,
model: str = "Llama3.2-3B-Instruct",
checkpoint_dir: str = "${env.CHECKPOINT_DIR:null}",
**kwargs,
) -> Dict[str, Any]:
config = super().sample_run_config(model, checkpoint_dir, **kwargs)
config["quantization"] = {
"type": "fp8",
}
return config

View file

@ -50,7 +50,7 @@ def process_template(template_dir: Path, progress) -> None:
template.save_distribution( template.save_distribution(
yaml_output_dir=REPO_ROOT / "llama_stack" / "templates" / template.name, yaml_output_dir=REPO_ROOT / "llama_stack" / "templates" / template.name,
doc_output_dir=REPO_ROOT doc_output_dir=REPO_ROOT
/ "docs/source/getting_started/distributions" / "docs/source/distributions"
/ f"{template.distro_type}_distro", / f"{template.distro_type}_distro",
) )
else: else:

View file

@ -1,5 +1,12 @@
# Bedrock Distribution # Bedrock Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations: The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations:
{{ providers_table }} {{ providers_table }}

View file

@ -1,5 +1,12 @@
# Fireworks Distribution # Fireworks Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations. The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations.
{{ providers_table }} {{ providers_table }}

View file

@ -1,5 +1,12 @@
# Meta Reference Distribution # Meta Reference Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations: The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations:
{{ providers_table }} {{ providers_table }}

View file

@ -1,13 +1,19 @@
version: '2'
name: meta-reference-quantized-gpu name: meta-reference-quantized-gpu
distribution_spec: distribution_spec:
docker_image: pytorch/pytorch:2.5.0-cuda12.4-cudnn9-runtime description: Use Meta Reference with fp8, int4 quantization for running LLM inference
description: Use code from `llama_stack` itself to serve all llama stack APIs docker_image: null
providers: providers:
inference: meta-reference-quantized inference:
- inline::meta-reference-quantized
memory: memory:
- inline::faiss - inline::faiss
- remote::chromadb - remote::chromadb
- remote::pgvector - remote::pgvector
safety: inline::llama-guard safety:
agents: inline::meta-reference - inline::llama-guard
telemetry: inline::meta-reference agents:
- inline::meta-reference
telemetry:
- inline::meta-reference
image_type: conda

View file

@ -1,54 +1,87 @@
# Meta Reference Quantized Distribution # Meta Reference Quantized Distribution
The `llamastack/distribution-meta-reference-quantized-gpu` distribution consists of the following provider configurations. ```{toctree}
:maxdepth: 2
:hidden:
self
```
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** | The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations:
|----------------- |------------------------ |---------------- |-------------------------------------------------- |---------------- |---------------- |
| **Provider(s)** | meta-reference-quantized | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference | {{ providers_table }}
The only difference vs. the `meta-reference-gpu` distribution is that it has support for more efficient inference -- with fp8, int4 quantization, etc. The only difference vs. the `meta-reference-gpu` distribution is that it has support for more efficient inference -- with fp8, int4 quantization, etc.
### Step 0. Prerequisite - Downloading Models Note that you need access to nvidia GPUs to run this distribution. This distribution is not compatible with CPU-only machines or machines with AMD GPUs.
Please make sure you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](https://llama-stack.readthedocs.io/en/latest/cli_reference/download_models.html) here to download the models.
{% if run_config_env_vars %}
### Environment Variables
The following environment variables can be configured:
{% for var, (default_value, description) in run_config_env_vars.items() %}
- `{{ var }}`: {{ description }} (default: `{{ default_value }}`)
{% endfor %}
{% endif %}
## Prerequisite: Downloading Models
Please make sure you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](https://llama-stack.readthedocs.io/en/latest/cli_reference/download_models.html) here to download the models. Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
``` ```
$ ls ~/.llama/checkpoints $ ls ~/.llama/checkpoints
Llama3.2-3B-Instruct:int4-qlora-eo8 Llama3.1-8B Llama3.2-11B-Vision-Instruct Llama3.2-1B-Instruct Llama3.2-90B-Vision-Instruct Llama-Guard-3-8B
Llama3.1-8B-Instruct Llama3.2-1B Llama3.2-3B-Instruct Llama-Guard-3-1B Prompt-Guard-86M
``` ```
### Step 1. Start the Distribution ## Running the Distribution
#### (Option 1) Start with Docker
``` You can do this via Conda (build code) or Docker which has a pre-built image.
$ cd distributions/meta-reference-quantized-gpu && docker compose up
### Via Docker
This method allows you to get started quickly without having to build the distribution code.
```bash
LLAMA_STACK_PORT=5001
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
llamastack/distribution-{{ name }} \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
``` ```
> [!NOTE] If you are using Llama Stack Safety / Shield APIs, use:
> This assumes you have access to GPU to start a local server with access to your GPU.
```bash
> [!NOTE] docker run \
> `~/.llama` should be the path containing downloaded weights of Llama models. -it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
llamastack/distribution-{{ name }} \
This will download and start running a pre-built docker container. Alternatively, you may use the following commands: --port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
``` --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-quantized-gpu --yaml_config /root/my-run.yaml
``` ```
#### (Option 2) Start with Conda ### Via Conda
1. Install the `llama` CLI. See [CLI Reference](https://llama-stack.readthedocs.io/en/latest/cli_reference/index.html) Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.
2. Build the `meta-reference-quantized-gpu` distribution ```bash
llama stack build --template {{ name }} --image-type conda
``` llama stack run distributions/{{ name }}/run.yaml \
$ llama stack build --template meta-reference-quantized-gpu --image-type conda --port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
``` ```
3. Start running distribution If you are using Llama Stack Safety / Shield APIs, use:
```
$ cd distributions/meta-reference-quantized-gpu ```bash
$ llama stack run ./run.yaml llama stack run distributions/{{ name }}/run-with-safety.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
--env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
``` ```

View file

@ -6,16 +6,16 @@
from pathlib import Path from pathlib import Path
from llama_stack.distribution.datatypes import ModelInput, Provider, ShieldInput from llama_stack.distribution.datatypes import ModelInput, Provider
from llama_stack.providers.inline.inference.meta_reference import ( from llama_stack.providers.inline.inference.meta_reference import (
MetaReferenceInferenceConfig, MetaReferenceQuantizedInferenceConfig,
) )
from llama_stack.templates.template import DistributionTemplate, RunConfigSettings from llama_stack.templates.template import DistributionTemplate, RunConfigSettings
def get_distribution_template() -> DistributionTemplate: def get_distribution_template() -> DistributionTemplate:
providers = { providers = {
"inference": ["inline::meta-reference"], "inference": ["inline::meta-reference-quantized"],
"memory": ["inline::faiss", "remote::chromadb", "remote::pgvector"], "memory": ["inline::faiss", "remote::chromadb", "remote::pgvector"],
"safety": ["inline::llama-guard"], "safety": ["inline::llama-guard"],
"agents": ["inline::meta-reference"], "agents": ["inline::meta-reference"],
@ -24,8 +24,8 @@ def get_distribution_template() -> DistributionTemplate:
inference_provider = Provider( inference_provider = Provider(
provider_id="meta-reference-inference", provider_id="meta-reference-inference",
provider_type="inline::meta-reference", provider_type="inline::meta-reference-quantized",
config=MetaReferenceInferenceConfig.sample_run_config( config=MetaReferenceQuantizedInferenceConfig.sample_run_config(
model="${env.INFERENCE_MODEL}", model="${env.INFERENCE_MODEL}",
checkpoint_dir="${env.INFERENCE_CHECKPOINT_DIR:null}", checkpoint_dir="${env.INFERENCE_CHECKPOINT_DIR:null}",
), ),
@ -35,18 +35,13 @@ def get_distribution_template() -> DistributionTemplate:
model_id="${env.INFERENCE_MODEL}", model_id="${env.INFERENCE_MODEL}",
provider_id="meta-reference-inference", provider_id="meta-reference-inference",
) )
safety_model = ModelInput(
model_id="${env.SAFETY_MODEL}",
provider_id="meta-reference-safety",
)
return DistributionTemplate( return DistributionTemplate(
name="meta-reference-gpu", name="meta-reference-quantized-gpu",
distro_type="self_hosted", distro_type="self_hosted",
description="Use Meta Reference for running LLM inference", description="Use Meta Reference with fp8, int4 quantization for running LLM inference",
template_path=Path(__file__).parent / "doc_template.md", template_path=Path(__file__).parent / "doc_template.md",
providers=providers, providers=providers,
default_models=[inference_model, safety_model], default_models=[inference_model],
run_configs={ run_configs={
"run.yaml": RunConfigSettings( "run.yaml": RunConfigSettings(
provider_overrides={ provider_overrides={
@ -54,26 +49,6 @@ def get_distribution_template() -> DistributionTemplate:
}, },
default_models=[inference_model], default_models=[inference_model],
), ),
"run-with-safety.yaml": RunConfigSettings(
provider_overrides={
"inference": [
inference_provider,
Provider(
provider_id="meta-reference-safety",
provider_type="inline::meta-reference",
config=MetaReferenceInferenceConfig.sample_run_config(
model="${env.SAFETY_MODEL}",
checkpoint_dir="${env.SAFETY_CHECKPOINT_DIR:null}",
),
),
],
},
default_models=[
inference_model,
safety_model,
],
default_shields=[ShieldInput(shield_id="${env.SAFETY_MODEL}")],
),
}, },
run_config_env_vars={ run_config_env_vars={
"LLAMASTACK_PORT": ( "LLAMASTACK_PORT": (
@ -88,13 +63,5 @@ def get_distribution_template() -> DistributionTemplate:
"null", "null",
"Directory containing the Meta Reference model checkpoint", "Directory containing the Meta Reference model checkpoint",
), ),
"SAFETY_MODEL": (
"meta-llama/Llama-Guard-3-1B",
"Name of the safety (Llama-Guard) model to use",
),
"SAFETY_CHECKPOINT_DIR": (
"null",
"Directory containing the Llama-Guard model checkpoint",
),
}, },
) )

View file

@ -0,0 +1,58 @@
version: '2'
image_name: meta-reference-quantized-gpu
docker_image: null
conda_env: meta-reference-quantized-gpu
apis:
- agents
- inference
- memory
- safety
- telemetry
providers:
inference:
- provider_id: meta-reference-inference
provider_type: inline::meta-reference-quantized
config:
model: ${env.INFERENCE_MODEL}
max_seq_len: 4096
checkpoint_dir: ${env.INFERENCE_CHECKPOINT_DIR:null}
quantization:
type: fp8
memory:
- provider_id: faiss
provider_type: inline::faiss
config:
kvstore:
type: sqlite
namespace: null
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/meta-reference-quantized-gpu}/faiss_store.db
safety:
- provider_id: llama-guard
provider_type: inline::llama-guard
config: {}
agents:
- provider_id: meta-reference
provider_type: inline::meta-reference
config:
persistence_store:
type: sqlite
namespace: null
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/meta-reference-quantized-gpu}/agents_store.db
telemetry:
- provider_id: meta-reference
provider_type: inline::meta-reference
config: {}
metadata_store:
namespace: null
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/meta-reference-quantized-gpu}/registry.db
models:
- metadata: {}
model_id: ${env.INFERENCE_MODEL}
provider_id: meta-reference-inference
provider_model_id: null
shields: []
memory_banks: []
datasets: []
scoring_fns: []
eval_tasks: []

View file

@ -1,5 +1,12 @@
# Ollama Distribution # Ollama Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations. The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations.
{{ providers_table }} {{ providers_table }}

View file

@ -1,4 +1,10 @@
# Remote vLLM Distribution # Remote vLLM Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations: The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations:

View file

@ -1,5 +1,12 @@
# TGI Distribution # TGI Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations. The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations.
{{ providers_table }} {{ providers_table }}

View file

@ -1,4 +1,11 @@
# Fireworks Distribution # Together Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations. The `llamastack/distribution-{{ name }}` distribution consists of the following provider configurations.