Merge branch 'main' into dell_tgi

2025-12-11 03:46:03 +00:00 · 2024-10-17 13:48:00 -07:00 · 2024-10-17 13:48:00 -07:00 · 560b3b5461
commit 560b3b5461
parent eaab21dd48 9fcf5d58e0
18 changed files with 101 additions and 102 deletions
--- a/README.md
+++ b/README.md
@ -90,10 +90,10 @@ The `llama` CLI makes it easy to work with the Llama Stack set of tools. Please
 * [CLI reference](docs/cli_reference.md)
    * Guide using `llama` CLI to work with Llama models (download, study prompts), and building/starting a Llama Stack distribution.
 * [Getting Started](docs/getting_started.md)
-    * Guide to build and run a Llama Stack server.
+    * Guide to start a Llama Stack server.
+    * [Jupyter notebook](./docs/getting_started.ipynb) to walk-through how to use simple text and vision inference llama_stack_client APIs
 * [Contributing](CONTRIBUTING.md)

-
 ## Llama Stack Client SDK

 |  **Language** |  **Client SDK** | **Package** |
@ -104,3 +104,5 @@ The `llama` CLI makes it easy to work with the Llama Stack set of tools. Please
 | Kotlin | [llama-stack-client-kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) |

 Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications.
+
+You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
--- a/docs/getting_started.md
+++ b/docs/getting_started.md
@ -1,45 +1,9 @@
-# llama-stack
-
-[![PyPI - Downloads](https://img.shields.io/pypi/dm/llama-stack)](https://pypi.org/project/llama-stack/)
-[![Discord](https://img.shields.io/discord/1257833999603335178)](https://discord.gg/llama-stack)
-
-This repository contains the specifications and implementations of the APIs which are part of the Llama Stack.
-
-The Llama Stack defines and standardizes the building blocks needed to bring generative AI applications to market. These blocks span the entire development lifecycle: from model training and fine-tuning, through product evaluation, to invoking AI agents in production. Beyond definition, we're developing open-source versions and partnering with cloud providers, ensuring developers can assemble AI solutions using consistent, interlocking pieces across platforms. The ultimate goal is to accelerate innovation in the AI space.
-
-The Stack APIs are rapidly improving, but still very much work in progress and we invite feedback as well as direct contributions.
-
-
-## APIs
-
-The Llama Stack consists of the following set of APIs:
-
- Inference
- Safety
- Memory
- Agentic System
- Evaluation
- Post Training
- Synthetic Data Generation
- Reward Scoring
-
-Each of the APIs themselves is a collection of REST endpoints.
-
-## API Providers
-
-A Provider is what makes the API real -- they provide the actual implementation backing the API.
-
-As an example, for Inference, we could have the implementation be backed by open source libraries like `[ torch | vLLM | TensorRT ]` as possible options.
-
-A provider can also be just a pointer to a remote REST service -- for example, cloud providers or dedicated inference providers could serve these APIs.
-
-
-## Llama Stack Distribution
-
-A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications.
+# Getting Started with Llama Stack

+This guide will walk you though the steps to get started on end-to-end flow for LlamaStack. This guide mainly focuses on getting started with building a LlamaStack distribution, and starting up a LlamaStack server. Please see our [documentations](../README.md) on what you can do with Llama Stack, and [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) on examples apps built with Llama Stack.

 ## Installation
+The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-stack` package.

 You can install this repository as a [package](https://pypi.org/project/llama-stack/) with `pip install llama-stack`

@ -57,26 +21,40 @@ cd llama-stack
 $CONDA_PREFIX/bin/pip install -e .
 ```

-# Getting Started
+For what you can do with the Llama CLI, please refer to [CLI Reference](./cli_reference.md).

-The `llama` CLI tool helps you setup and use the Llama toolchain & agentic systems. It should be available on your path after installing the `llama-stack` package.
+## Quick Starting Llama Stack Server

-This guides allows you to quickly get started with building and running a Llama Stack server in < 5 minutes!
+#### Starting up server via docker

-You may also checkout this [notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) for trying out out demo scripts.
+We provide 2 pre-built Docker image of Llama Stack distribution, which can be found in the following links.
+- [llamastack-local-gpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-gpu/general)
+  - This is a packaged version with our local meta-reference implementations, where you will be running inference locally with downloaded Llama model checkpoints.
+- [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general)
+   - This is a lite version with remote inference where you can hook up to your favourite remote inference framework (e.g. ollama, fireworks, together, tgi) for running inference without GPU.

-## Quick Cheatsheet
-
-#### Via docker
+> [!NOTE]
+> For GPU inference, you need to set these environment variables for specifying local directory containing your model checkpoints, and enable GPU inference to start running docker container.
 ```
-docker run -it -p 5000:5000 -v ~/.llama:/root/.llama --gpus=all llamastack-local-gpu
+export LLAMA_CHECKPOINT_DIR=~/.llama
 ```

 > [!NOTE]
 > `~/.llama` should be the path containing downloaded weights of Llama models.


-#### Via conda
+To download and start running a pre-built docker container, you may use the following commands:
+
+```
+docker run -it -p 5000:5000 -v ~/.llama:/root/.llama --gpus=all llamastack/llamastack-local-gpu
+```
+
+> [!TIP]
+> Pro Tip: We may use `docker compose up` for starting up a distribution with remote providers (e.g. TGI) using [llamastack-local-cpu](https://hub.docker.com/repository/docker/llamastack/llamastack-local-cpu/general). You can checkout [these scripts](../llama_stack/distribution/docker/README.md) to help you get started.
+
+#### Build->Configure->Run Llama Stack server via conda
+You may also build a LlamaStack distribution from scratch, configure it, and start running the distribution. This is useful for developing on LlamaStack.
+
 **`llama stack build`**
 - You'll be prompted to enter build information interactively.
 ```
@ -182,6 +160,7 @@ INFO:     Application startup complete.
 INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
 ```

+## Building a Distribution

 ## Step 1. Build
 In the following steps, imagine we'll be working with a `Meta-Llama3.1-8B-Instruct` model. We will name our build `8b-instruct` to help us remember the config. We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify:
@ -445,4 +424,7 @@ Similarly you can test safety (if you configured llama-guard and/or prompt-guard
 python -m llama_stack.apis.safety.client localhost 5000
 ```

+
+Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications.
+
 You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
--- a/llama_stack/cli/download.py
+++ b/llama_stack/cli/download.py
@ -152,27 +152,29 @@ def run_download_cmd(args: argparse.Namespace, parser: argparse.ArgumentParser):
        parser.error("Please provide a model id")
        return

-    prompt_guard = prompt_guard_model_sku()
-    if args.model_id == prompt_guard.model_id:
-        model = prompt_guard
-        info = prompt_guard_download_info()
-    else:
-        model = resolve_model(args.model_id)
-        if model is None:
-            parser.error(f"Model {args.model_id} not found")
-            return
-        info = llama_meta_net_info(model)
+    # Check if model_id is a comma-separated list
+    model_ids = [model_id.strip() for model_id in args.model_id.split(",")]

-    if args.source == "huggingface":
-        _hf_download(model, args.hf_token, args.ignore_patterns, parser)
-    else:
-        meta_url = args.meta_url
-        if not meta_url:
-            meta_url = input(
-                "Please provide the signed URL you received via email after visiting https://www.llama.com/llama-downloads/ (e.g., https://llama3-1.llamameta.net/*?Policy...): "
+    prompt_guard = prompt_guard_model_sku()
+    for model_id in model_ids:
+        if model_id == prompt_guard.model_id:
+            model = prompt_guard
+            info = prompt_guard_download_info()
+        else:
+            model = resolve_model(model_id)
+            if model is None:
+                parser.error(f"Model {model_id} not found")
+                continue
+            info = llama_meta_net_info(model)
+
+        if args.source == "huggingface":
+            _hf_download(model, args.hf_token, args.ignore_patterns, parser)
+        else:
+            meta_url = args.meta_url or input(
+                f"Please provide the signed URL for model {model_id} you received via email after visiting https://www.llama.com/llama-downloads/ (e.g., https://llama3-1.llamameta.net/*?Policy...): "
            )
-            assert meta_url is not None and "llamameta.net" in meta_url
-        _meta_download(model, meta_url, info)
+            assert "llamameta.net" in meta_url
+            _meta_download(model, meta_url, info)


 class ModelEntry(BaseModel):
--- a/llama_stack/distribution/templates/build_configs/local-bedrock-conda-example-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-bedrock-conda-example-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-cpu-docker-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-cpu-docker-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-databricks-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-databricks-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-fireworks-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-fireworks-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-gpu-docker-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-gpu-docker-build.yaml
@ -1,4 +1,4 @@
-name: local
+name: local-gpu
 distribution_spec:
  description: Use code from `llama_stack` itself to serve all llama stack APIs
  providers:
@ -7,4 +7,4 @@ distribution_spec:
    safety: meta-reference
    agents: meta-reference
    telemetry: meta-reference
-image_type: conda
+image_type: docker
--- a/llama_stack/distribution/templates/build_configs/local-hf-endpoint-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-hf-endpoint-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-hf-serverless-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-hf-serverless-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-ollama-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-ollama-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-tgi-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-tgi-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-tgi-chroma-docker-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-tgi-chroma-docker-build.yaml
@ -1,11 +1,11 @@
-name: local-gpu
+name: local-tgi-chroma
 distribution_spec:
-  description: local meta reference
+  description: remote tgi inference + chromadb memory
  docker_image: null
  providers:
-    inference: meta-reference
+    inference: remote::tgi
    safety: meta-reference
    agents: meta-reference
-    memory: meta-reference
+    memory: remote::chromadb
    telemetry: meta-reference
 image_type: docker
--- a/llama_stack/distribution/templates/build_configs/local-together-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-together-build.yaml
--- a/llama_stack/distribution/templates/build_configs/local-vllm-build.yaml
+++ b/llama_stack/distribution/templates/build_configs/local-vllm-build.yaml
--- a/llama_stack/distribution/templates/docker/llamastack-local-gpu/run.yaml
+++ b/llama_stack/distribution/templates/docker/llamastack-local-gpu/run.yaml
@ -1,16 +1,16 @@
 version: '2'
-built_at: '2024-10-08T17:42:33.690666'
-image_name: local-gpu
-docker_image: local-gpu
-conda_env: null
+built_at: '2024-10-08T17:40:45.325529'
+image_name: local
+docker_image: null
+conda_env: local
 apis:
- memory
- inference
- agents
 - shields
- safety
+- agents
 - models
+- memory
 - memory_banks
+- inference
+- safety
 providers:
  inference:
  - provider_id: meta-reference
@ -25,8 +25,13 @@ providers:
  - provider_id: meta-reference
    provider_type: meta-reference
    config:
-      llama_guard_shield: null
-      prompt_guard_shield: null
+      llama_guard_shield:
+        model: Llama-Guard-3-1B
+        excluded_categories: []
+        disable_input_check: false
+        disable_output_check: false
+      prompt_guard_shield:
+        model: Prompt-Guard-86M
  memory:
  - provider_id: meta-reference
    provider_type: meta-reference
--- a/llama_stack/distribution/templates/docker/llamastack-local-cpu/run.yaml
+++ b/llama_stack/distribution/templates/docker/llamastack-local-cpu/run.yaml
@ -1,29 +1,33 @@
 version: '2'
-built_at: '2024-10-08T17:42:07.505267'
-image_name: local-cpu
-docker_image: local-cpu
-conda_env: null
+built_at: '2024-10-08T17:40:45.325529'
+image_name: local
+docker_image: null
+conda_env: local
 apis:
+- shields
 - agents
- inference
 - models
 - memory
- safety
- shields
 - memory_banks
+- inference
+- safety
 providers:
  inference:
-  - provider_id: remote::ollama
-    provider_type: remote::ollama
+  - provider_id: tgi0
+    provider_type: remote::tgi
    config:
-      host: localhost
-      port: 6000
+      url: http://127.0.0.1:5009
  safety:
  - provider_id: meta-reference
    provider_type: meta-reference
    config:
-      llama_guard_shield: null
-      prompt_guard_shield: null
+      llama_guard_shield:
+        model: Llama-Guard-3-1B
+        excluded_categories: []
+        disable_input_check: false
+        disable_output_check: false
+      prompt_guard_shield:
+        model: Prompt-Guard-86M
  memory:
  - provider_id: meta-reference
    provider_type: meta-reference
--- a/llama_stack/providers/tests/inference/test_inference.py
+++ b/llama_stack/providers/tests/inference/test_inference.py
@ -5,6 +5,7 @@
 # the root directory of this source tree.

 import itertools
+import os

 import pytest
 import pytest_asyncio
@ -50,14 +51,17 @@ def get_expected_stop_reason(model: str):
    return StopReason.end_of_message if "Llama3.1" in model else StopReason.end_of_turn


+if "MODEL_IDS" not in os.environ:
+    MODEL_IDS = [Llama_8B, Llama_3B]
+else:
+    MODEL_IDS = os.environ["MODEL_IDS"].split(",")
+
+
 # This is going to create multiple Stack impls without tearing down the previous one
 # Fix that!
@pytest_asyncio.fixture(
    scope="session",
-    params=[
-        {"model": Llama_8B},
-        {"model": Llama_3B},
-    ],
+    params=[{"model": m} for m in MODEL_IDS],
    ids=lambda d: d["model"],
 )
 async def inference_settings(request):