Merge branch 'main' into dell_tgi

2025-07-29 15:23:51 +00:00 · 2024-10-17 13:48:00 -07:00 · 2024-10-17 13:48:00 -07:00 · 560b3b5461
commit 560b3b5461
parent eaab21dd48 9fcf5d58e0
18 changed files with 101 additions and 102 deletions
--- a/README.md
+++ b/README.md
@ -90,10 +90,10 @@ The `llama` CLI makes it easy to work with the Llama Stack set of tools. Please
 * [CLI reference](docs/cli_reference.md)
    * Guide using `llama` CLI to work with Llama models (download, study prompts), and building/starting a Llama Stack distribution.
 * [Getting Started](docs/getting_started.md)
-    * Guide to build and run a Llama Stack server.
+    * Guide to start a Llama Stack server.
    * [Jupyter notebook](./docs/getting_started.ipynb) to walk-through how to use simple text and vision inference llama_stack_client APIs
 * [Contributing](CONTRIBUTING.md)
 ## Llama Stack Client SDK
 |  **Language** |  **Client SDK** | **Package** |
@ -104,3 +104,5 @@ The `llama` CLI makes it easy to work with the Llama Stack set of tools. Please
 | Kotlin | [llama-stack-client-kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) |
 Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications.
 You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
--- a/docs/getting_started.md
+++ b/docs/getting_started.md
@ -1,45 +1,9 @@
-# llama-stack
+# Getting Started with Llama Stack
 [![PyPI - Downloads](https://img.shields.io/pypi/dm/llama-stack)](https://pypi.org/project/llama-stack/)
 [![Discord](https://img.shields.io/discord/1257833999603335178)](https://discord.gg/llama-stack)
 This repository contains the specifications and implementations of the APIs which are part of the Llama Stack.
 The Llama Stack defines and standardizes the building blocks needed to bring generative AI applications to market. These blocks span the entire development lifecycle: from model training and fine-tuning, through product evaluation, to invoking AI agents in production. Beyond definition, we're developing open-source versions and partnering with cloud providers, ensuring developers can assemble AI solutions using consistent, interlocking pieces across platforms. The ultimate goal is to accelerate innovation in the AI space.
 The Stack APIs are rapidly improving, but still very much work in progress and we invite feedback as well as direct contributions.
 ## APIs
 The Llama Stack consists of the following set of APIs:
 - Inference
 - Safety
 - Memory
 - Agentic System
 - Evaluation
 - Post Training
 - Synthetic Data Generation
 - Reward Scoring
 Each of the APIs themselves is a collection of REST endpoints.
 ## API Providers
 A Provider is what makes the API real -- they provide the actual implementation backing the API.
 As an example, for Inference, we could have the implementation be backed by open source libraries like `[ torch | vLLM | TensorRT ]` as possible options.
 A provider can also be just a pointer to a remote REST service -- for example, cloud providers or dedicated inference providers could serve these APIs.
 ## Llama Stack Distribution
 A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications.
 This guide will walk you though the steps to get started on end-to-end flow for LlamaStack. This guide mainly focuses on getting started with building a LlamaStack distribution, and starting up a LlamaStack server. Please see our [documentations](../README.md) on what you can do with Llama Stack, and [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) on examples apps built with Llama Stack.
 ## Installation
 The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-stack` package.
 You can install this repository as a [package](https://pypi.org/project/llama-stack/) with `pip install llama-stack`
@ -57,26 +21,40 @@ cd llama-stack
 $CONDA_PREFIX/bin/pip install -e .
 ```
-# Getting Started
+For what you can do with the Llama CLI, please refer to [CLI Reference](./cli_reference.md).
-The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-stack` package.
+## Quick Starting Llama Stack Server
-This guides allows you to quickly get started with building and running a Llama Stack server in < 5 minutes!
+#### Starting up server via docker
-You may also checkout this [notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for trying out out demo scripts.
+We provide 2 pre-built Docker image of Llama Stack distribution, which can be found in the following links.
 - [llamastack-local-gpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-gpu/general)
  - This is a packaged version with our local meta-reference implementations, where you will be running inference locally with downloaded Llama model checkpoints.
 - [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general)
   - This is a lite version with remote inference where you can hook up to your favourite remote inference framework (e.g. ollama, fireworks, together, tgi) for running inference without GPU.
-## Quick Cheatsheet
+> [!NOTE]
-
+> For GPU inference, you need to set these environment variables for specifying local directory containing your model checkpoints, and enable GPU inference to start running docker container.
 #### Via docker
 ```
-docker run -it -p 5000:5000 -v ~/.llama:/root/.llama --gpus=all llamastack-local-gpu
+export LLAMA_CHECKPOINT_DIR=~/.llama
 ```
 > [!NOTE]
 > `~/.llama` should be the path containing downloaded weights of Llama models.
-#### Via conda
+To download and start running a pre-built docker container, you may use the following commands:
 ```
 docker run -it -p 5000:5000 -v ~/.llama:/root/.llama --gpus=all llamastack/llamastack-local-gpu
 ```
 > [!TIP]
 > Pro Tip: We may use `docker compose up` for starting up a distribution with remote providers (e.g. TGI) using [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general). You can checkout [these scripts](../llama_stack/distribution/docker/README.md) to help you get started.
 #### Build->Configure->Run Llama Stack server via conda
 You may also build a LlamaStack distribution from scratch, configure it, and start running the distribution. This is useful for developing on LlamaStack.
 **`llama stack build`**
 - You'll be prompted to enter build information interactively.
 ```
@ -182,6 +160,7 @@ INFO:     Application startup complete.
 INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
 ```
 ## Building a Distribution
 ## Step 1. Build
 In the following steps, imagine we'll be working with a `Meta-Llama3.1-8B-Instruct` model. We will name our build `8b-instruct` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify:
@ -445,4 +424,7 @@ Similarly you can test safety (if you configured llama-guard and/or prompt-guard
 python -m llama_stack.apis.safety.client localhost 5000
 ```
 Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications.
 You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
--- a/llama_stack/cli/download.py
+++ b/llama_stack/cli/download.py
@ -152,26 +152,28 @@ def run_download_cmd(args: argparse.Namespace, parser: argparse.ArgumentParser):
        parser.error("Please provide a model id")
        return
    # Check if model_id is a comma-separated list
    model_ids = [model_id.strip() for model_id in args.model_id.split(",")]
    prompt_guard = prompt_guard_model_sku()
-    if args.model_id == prompt_guard.model_id:
+    for model_id in model_ids:
        if model_id == prompt_guard.model_id:
            model = prompt_guard
            info = prompt_guard_download_info()
        else:
-        model = resolve_model(args.model_id)
+            model = resolve_model(model_id)
            if model is None:
-            parser.error(f"Model {args.model_id} not found")
+                parser.error(f"Model {model_id} not found")
-            return
+                continue
            info = llama_meta_net_info(model)
        if args.source == "huggingface":
            _hf_download(model, args.hf_token, args.ignore_patterns, parser)
        else:
-        meta_url = args.meta_url
+            meta_url = args.meta_url or input(
-        if not meta_url:
+                f"Please provide the signed URL for model {model_id} you received via email after visiting https://www.llama.com/llama-downloads/ (e.g., https://llama3-1.llamameta.net/*?Policy...): "
            meta_url = input(
                "Please provide the signed URL you received via email after visiting https://www.llama.com/llama-downloads/ (e.g., https://llama3-1.llamameta.net/*?Policy...): "
            )
-            assert meta_url is not None and "llamameta.net" in meta_url
+            assert "llamameta.net" in meta_url
            _meta_download(model, meta_url, info)
--- a/llama_stack/distribution/templates/build_configs/local-bedrock-conda-example-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-bedrock-conda-example-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-cpu-docker-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-cpu-docker-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-databricks-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-databricks-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-fireworks-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-fireworks-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-gpu-docker-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-gpu-docker-build.yaml
@ -1,4 +1,4 @@
-name: local
+name: local-gpu
 distribution_spec:
  description: Use code from `llama_stack` itself to serve all llama stack APIs
  providers:
@ -7,4 +7,4 @@ distribution_spec:
    safety: meta-reference
    agents: meta-reference
    telemetry: meta-reference
-image_type: conda
+image_type: docker
--- a/llama_stack/distribution/templates/build_configs/local-hf-endpoint-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-hf-endpoint-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-hf-serverless-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-hf-serverless-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-ollama-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-ollama-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-tgi-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-tgi-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-tgi-chroma-docker-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-tgi-chroma-docker-build.yaml
@ -1,11 +1,11 @@
-name: local-gpu
+name: local-tgi-chroma
 distribution_spec:
-  description: local meta reference
+  description: remote tgi inference + chromadb memory
  docker_image: null
  providers:
-    inference: meta-reference
+    inference: remote::tgi
    safety: meta-reference
    agents: meta-reference
-    memory: meta-reference
+    memory: remote::chromadb
    telemetry: meta-reference
 image_type: docker
--- a/llama_stack/distribution/templates/build_configs/local-together-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-together-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-vllm-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-vllm-build.yaml
--- a/llama_stack/distribution/templates/docker/llamastack-local-gpu/run.yaml
+++ b/llama_stack/distribution/templates/docker/llamastack-local-gpu/run.yaml
@ -1,16 +1,16 @@
 version: '2'
-built_at: '2024-10-08T17:42:33.690666'
+built_at: '2024-10-08T17:40:45.325529'
-image_name: local-gpu
+image_name: local
-docker_image: local-gpu
+docker_image: null
-conda_env: null
+conda_env: local
 apis:
 - memory
 - inference
 - agents
 - shields
- safety
+- agents
 - models
 - memory
 - memory_banks
 - inference
 - safety
 providers:
  inference:
  - provider_id: meta-reference
@ -25,8 +25,13 @@ providers:
  - provider_id: meta-reference
    provider_type: meta-reference
    config:
-      llama_guard_shield: null
+      llama_guard_shield:
-      prompt_guard_shield: null
+        model: Llama-Guard-3-1B
        excluded_categories: []
        disable_input_check: false
        disable_output_check: false
      prompt_guard_shield:
        model: Prompt-Guard-86M
  memory:
  - provider_id: meta-reference
    provider_type: meta-reference
--- a/llama_stack/distribution/templates/docker/llamastack-local-cpu/run.yaml
+++ b/llama_stack/distribution/templates/docker/llamastack-local-cpu/run.yaml
@ -1,29 +1,33 @@
 version: '2'
-built_at: '2024-10-08T17:42:07.505267'
+built_at: '2024-10-08T17:40:45.325529'
-image_name: local-cpu
+image_name: local
-docker_image: local-cpu
+docker_image: null
-conda_env: null
+conda_env: local
 apis:
 - shields
 - agents
 - inference
 - models
 - memory
 - safety
 - shields
 - memory_banks
 - inference
 - safety
 providers:
  inference:
-  - provider_id: remote::ollama
+  - provider_id: tgi0
-    provider_type: remote::ollama
+    provider_type: remote::tgi
    config:
-      host: localhost
+      url: http://127.0.0.1:5009
      port: 6000
  safety:
  - provider_id: meta-reference
    provider_type: meta-reference
    config:
-      llama_guard_shield: null
+      llama_guard_shield:
-      prompt_guard_shield: null
+        model: Llama-Guard-3-1B
        excluded_categories: []
        disable_input_check: false
        disable_output_check: false
      prompt_guard_shield:
        model: Prompt-Guard-86M
  memory:
  - provider_id: meta-reference
    provider_type: meta-reference
--- a/llama_stack/providers/tests/inference/test_inference.py
+++ b/llama_stack/providers/tests/inference/test_inference.py
@ -5,6 +5,7 @@
 # the root directory of this source tree.
 import itertools
 import os
 import pytest
 import pytest_asyncio
@ -50,14 +51,17 @@ def get_expected_stop_reason(model: str):
    return StopReason.end_of_message if "Llama3.1" in model else StopReason.end_of_turn
 if "MODEL_IDS" not in os.environ:
    MODEL_IDS = [Llama_8B, Llama_3B]
 else:
    MODEL_IDS = os.environ["MODEL_IDS"].split(",")
 # This is going to create multiple Stack impls without tearing down the previous one
 # Fix that!
@pytest_asyncio.fixture(
    scope="session",
-    params=[
+    params=[{"model": m} for m in MODEL_IDS],
        {"model": Llama_8B},
        {"model": Llama_3B},
    ],
    ids=lambda d: d["model"],
 )
 async def inference_settings(request):