add more distro templates (#279)

* verify dockers * together distro verified * readme * fireworks distro * fireworks compose up * fireworks verified
2025-12-04 10:10:36 +00:00 · 2024-10-21 18:15:08 -07:00 · 2024-10-21 18:15:08 -07:00 · 4d2bd2d39e
commit 4d2bd2d39e
parent cf27d19dd5
18 changed files with 265 additions and 42 deletions
--- a/distributions/README.md
+++ b/distributions/README.md
@ -9,3 +9,5 @@ A Distribution is where APIs and Providers are assembled together to provide a c
 |  Meta Reference  	| llamastack/distribution-meta-reference-gpu 	|       [Guide](./meta-reference-gpu/)       	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	|
 |      Ollama      	|       llamastack/distribution-ollama       	|       [Guide](./ollama/)       	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	|
 |        TGI       	|         llamastack/distribution-tgi        	|       [Guide](./tgi/)       	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	|
 |        Together       	|         llamastack/distribution-together        	|       [Guide](./together/)       	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	|
 |        Fireworks       	|         llamastack/distribution-fireworks        	|       [Guide](./fireworks/)       	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	| :heavy_check_mark: 	|
--- a/distributions/fireworks/README.md
+++ b/distributions/fireworks/README.md
@ -0,0 +1,55 @@
 # Fireworks Distribution
 The `llamastack/distribution-` distribution consists of the following provider configurations.
 | **API**         	| **Inference** 	| **Agents**     	| **Memory**                                       	| **Safety**     	| **Telemetry**  	|
 |-----------------	|---------------	|----------------	|--------------------------------------------------	|----------------	|----------------	|
 | **Provider(s)** 	| remote::fireworks   	| meta-reference 	| meta-reference 	| meta-reference 	| meta-reference 	|
 ### Start the Distribution (Single Node CPU)
 > [!NOTE]
 > This assumes you have an hosted endpoint at Fireworks with API Key.
 ```
 $ cd llama-stack/distribution/fireworks
 $ ls
 compose.yaml  run.yaml
 $ docker compose up
 ```
 Make sure in you `run.yaml` file, you inference provider is pointing to the correct Fireworks URL server endpoint. E.g.
 ```
 inference:
  - provider_id: fireworks
    provider_type: remote::fireworks
    config:
      url: https://api.fireworks.ai/inferenc
      api_key: <optional api key>
 ```
 ### (Alternative) TGI server + llama stack run (Single Node GPU)
 ```
 docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-fireworks --yaml_config /root/my-run.yaml
 ```
 Make sure in you `run.yaml` file, you inference provider is pointing to the correct Fireworks URL server endpoint. E.g.
 ```
 inference:
  - provider_id: fireworks
    provider_type: remote::fireworks
    config:
      url: https://api.fireworks.ai/inference
      api_key: <optional api key>
 ```
 **Via Conda**
 ```bash
 llama stack build --config ./build.yaml
 # -- modify run.yaml to a valid Fireworks server endpoint
 llama stack run ./run.yaml
 ```
--- a/distributions/fireworks/build.yaml
+++ b/distributions/fireworks/build.yaml
@ -7,4 +7,4 @@ distribution_spec:
    safety: meta-reference
    agents: meta-reference
    telemetry: meta-reference
-image_type: conda
+image_type: docker
--- a/distributions/fireworks/compose.yaml
+++ b/distributions/fireworks/compose.yaml
@ -0,0 +1,18 @@
 services:
  llamastack:
    image: llamastack/distribution-fireworks
    network_mode: "host"
    volumes:
      - ~/.llama:/root/.llama
      # Link to ollama run.yaml file
      - ./run.yaml:/root/llamastack-run-fireworks.yaml
    ports:
      - "5000:5000"
    # Hack: wait for ollama server to start before starting docker
    entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml_config /root/llamastack-run-fireworks.yaml"
    deploy:
      restart_policy:
        condition: on-failure
        delay: 3s
        max_attempts: 5
        window: 60s
--- a/distributions/fireworks/run.yaml
+++ b/distributions/fireworks/run.yaml
@ -0,0 +1,46 @@
 version: '2'
 built_at: '2024-10-08T17:40:45.325529'
 image_name: local
 docker_image: null
 conda_env: local
 apis:
 - shields
 - agents
 - models
 - memory
 - memory_banks
 - inference
 - safety
 providers:
  inference:
  - provider_id: fireworks0
    provider_type: remote::fireworks
    config:
      url: https://api.fireworks.ai/inference
  safety:
  - provider_id: meta0
    provider_type: meta-reference
    config:
      llama_guard_shield:
        model: Llama-Guard-3-1B
        excluded_categories: []
        disable_input_check: false
        disable_output_check: false
      prompt_guard_shield:
        model: Prompt-Guard-86M
  memory:
  - provider_id: meta0
    provider_type: meta-reference
    config: {}
  agents:
  - provider_id: meta0
    provider_type: meta-reference
    config:
      persistence_store:
        namespace: null
        type: sqlite
        db_path: ~/.llama/runtime/kvstore.db
  telemetry:
  - provider_id: meta0
    provider_type: meta-reference
    config: {}
--- a/distributions/meta-reference-gpu/README.md
+++ b/distributions/meta-reference-gpu/README.md
@ -11,13 +11,8 @@ The `llamastack/distribution-meta-reference-gpu` distribution consists of the fo
 ### Start the Distribution (Single Node GPU)
 > [!NOTE]
-> This assumes you have access to GPU to start a TGI server with access to your GPU.
+> This assumes you have access to GPU to start a local server with access to your GPU.
 > [!NOTE]
 > For GPU inference, you need to set these environment variables for specifying local directory containing your model checkpoints, and enable GPU inference to start running docker container.
 ```
 export LLAMA_CHECKPOINT_DIR=~/.llama
 ```
 > [!NOTE]
 > `~/.llama` should be the path containing downloaded weights of Llama models.
@ -26,8 +21,8 @@ export LLAMA_CHECKPOINT_DIR=~/.llama
 To download and start running a pre-built docker container, you may use the following commands:
 ```
-docker run -it -p 5000:5000 -v ~/.llama:/root/.llama --gpus=all llamastack/llamastack-local-gpu
+docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-gpu --yaml_config /root/my-run.yaml
 ```
 ### Alternative (Build and start distribution locally via conda)
- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on starting up a meta-reference distribution.
+- You may checkout the [Getting Started](../../docs/getting_started.md) for more details on building locally via conda and starting up a meta-reference distribution.
--- a/distributions/meta-reference-gpu/build.yaml
+++ b/distributions/meta-reference-gpu/build.yaml
@ -1,4 +1,4 @@
-name: distribution-meta-reference-gpu
+name: meta-reference-gpu
 distribution_spec:
  description: Use code from `llama_stack` itself to serve all llama stack APIs
  providers:
--- a/distributions/ollama/README.md
+++ b/distributions/ollama/README.md
@ -71,10 +71,10 @@ ollama run <model_id>
 **Via Docker**
 ```
-docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./ollama-run.yaml:/root/llamastack-run-ollama.yaml --gpus=all llamastack-local-cpu --yaml_config /root/llamastack-run-ollama.yaml
+docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./gpu/run.yaml:/root/llamastack-run-ollama.yaml --gpus=all distribution-ollama --yaml_config /root/llamastack-run-ollama.yaml
 ```
-Make sure in you `ollama-run.yaml` file, you inference provider is pointing to the correct Ollama endpoint. E.g.
+Make sure in you `run.yaml` file, you inference provider is pointing to the correct Ollama endpoint. E.g.
 ```
 inference:
  - provider_id: ollama0
--- a/distributions/ollama/build.yaml
+++ b/distributions/ollama/build.yaml
@ -1,4 +1,4 @@
-name: distribution-ollama
+name: ollama
 distribution_spec:
  description: Use ollama for running LLM inference
  providers:
@ -10,4 +10,4 @@ distribution_spec:
    safety: meta-reference
    agents: meta-reference
    telemetry: meta-reference
-image_type: conda
+image_type: docker
--- a/distributions/ollama/gpu/compose.yaml
+++ b/distributions/ollama/gpu/compose.yaml
@ -33,7 +33,7 @@ services:
    volumes:
      - ~/.llama:/root/.llama
      # Link to ollama run.yaml file
-      - ./ollama-run.yaml:/root/llamastack-run-ollama.yaml
+      - ./run.yaml:/root/llamastack-run-ollama.yaml
    ports:
      - "5000:5000"
    # Hack: wait for ollama server to start before starting docker
--- a/distributions/tgi/build.yaml
+++ b/distributions/tgi/build.yaml
@ -1,4 +1,4 @@
-name: distribution-tgi
+name: tgi
 distribution_spec:
  description: Use TGI for running LLM inference
  providers:
@ -10,4 +10,4 @@ distribution_spec:
    safety: meta-reference
    agents: meta-reference
    telemetry: meta-reference
-image_type: conda
+image_type: docker
--- a/distributions/tgi/cpu/compose.yaml
+++ b/distributions/tgi/cpu/compose.yaml
@ -6,28 +6,7 @@ services:
      - $HOME/.cache/huggingface:/data
    ports:
      - "5009:5009"
    devices:
      - nvidia.com/gpu=all
    environment:
      - CUDA_VISIBLE_DEVICES=0
      - HF_HOME=/data
      - HF_DATASETS_CACHE=/data
      - HF_MODULES_CACHE=/data
      - HF_HUB_CACHE=/data
    command: ["--dtype", "bfloat16", "--usage-stats", "on", "--sharded", "false", "--model-id", "meta-llama/Llama-3.1-8B-Instruct", "--port", "5009", "--cuda-memory-fraction", "0.3"]
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            # that's the closest analogue to --gpus; provide
            # an integer amount of devices or 'all'
            count: 1
            # Devices are reserved using a list of capabilities, making
            # capabilities the only required field. A device MUST
            # satisfy all the requested capabilities for a successful
            # reservation.
            capabilities: [gpu]
    runtime: nvidia
    healthcheck:
      test: ["CMD", "curl", "-f", "http://text-generation-inference:5009/health"]
--- a/distributions/together/README.md
+++ b/distributions/together/README.md
@ -0,0 +1,68 @@
 # Together Distribution
 ### Connect to a Llama Stack Together Endpoint
 - You may connect to a hosted endpoint `https://llama-stack.together.ai`, serving a Llama Stack distribution
 The `llamastack/distribution-together` distribution consists of the following provider configurations.
 | **API**         	| **Inference** 	| **Agents**     	| **Memory**                                       	| **Safety**     	| **Telemetry**  	|
 |-----------------	|---------------	|----------------	|--------------------------------------------------	|----------------	|----------------	|
 | **Provider(s)** 	| remote::together   	| meta-reference 	| remote::weaviate 	| meta-reference 	| meta-reference 	|
 ### Start the Distribution (Single Node CPU)
 > [!NOTE]
 > This assumes you have an hosted endpoint at Together with API Key.
 ```
 $ cd llama-stack/distribution/together
 $ ls
 compose.yaml  run.yaml
 $ docker compose up
 ```
 Make sure in you `run.yaml` file, you inference provider is pointing to the correct Together URL server endpoint. E.g.
 ```
 inference:
  - provider_id: together
    provider_type: remote::together
    config:
      url: https://api.together.xyz/v1
      api_key: <optional api key>
 ```
 ### (Alternative) TGI server + llama stack run (Single Node GPU)
 ```
 docker run --network host -it -p 5000:5000 -v ./run.yaml:/root/my-run.yaml --gpus=all llamastack/distribution-together --yaml_config /root/my-run.yaml
 ```
 Make sure in you `run.yaml` file, you inference provider is pointing to the correct Together URL server endpoint. E.g.
 ```
 inference:
  - provider_id: together
    provider_type: remote::together
    config:
      url: https://api.together.xyz/v1
      api_key: <optional api key>
 ```
 Together distribution comes with weaviate as Memory provider. We also need to configure the remote weaviate API key and URL in `run.yaml` to get memory API.
 ```
 memory:
  - provider_id: meta0
    provider_type: remote::weaviate
    config:
      weaviate_api_key: <ENTER_WEAVIATE_API_KEY>
      weaviate_cluster_url: <ENTER_WEAVIATE_CLUSTER_URL>
 ```
 **Via Conda**
 ```bash
 llama stack build --config ./build.yaml
 # -- modify run.yaml to a valid Together server endpoint
 llama stack run ./run.yaml
 ```
--- a/distributions/together/build.yaml
+++ b/distributions/together/build.yaml
@ -3,8 +3,8 @@ distribution_spec:
  description: Use Together.ai for running LLM inference
  providers:
    inference: remote::together
-    memory: meta-reference
+    memory: remote::weaviate
    safety: remote::together
    agents: meta-reference
    telemetry: meta-reference
-image_type: conda
+image_type: docker
--- a/distributions/together/compose.yaml
+++ b/distributions/together/compose.yaml
@ -0,0 +1,18 @@
 services:
  llamastack:
    image: llamastack/distribution-together
    network_mode: "host"
    volumes:
      - ~/.llama:/root/.llama
      # Link to ollama run.yaml file
      - ./run.yaml:/root/llamastack-run-together.yaml
    ports:
      - "5000:5000"
    # Hack: wait for ollama server to start before starting docker
    entrypoint: bash -c "python -m llama_stack.distribution.server.server --yaml_config /root/llamastack-run-together.yaml"
    deploy:
      restart_policy:
        condition: on-failure
        delay: 3s
        max_attempts: 5
        window: 60s
--- a/distributions/together/run.yaml
+++ b/distributions/together/run.yaml
@ -0,0 +1,42 @@
 version: '2'
 built_at: '2024-10-08T17:40:45.325529'
 image_name: local
 docker_image: null
 conda_env: local
 apis:
 - shields
 - agents
 - models
 - memory
 - memory_banks
 - inference
 - safety
 providers:
  inference:
  - provider_id: together0
    provider_type: remote::together
    config:
      url: https://api.together.xyz/v1
  safety:
  - provider_id: together0
    provider_type: remote::together
    config:
      url: https://api.together.xyz/v1
  memory:
  - provider_id: meta0
    provider_type: remote::weaviate
    config:
      weaviate_api_key: <ENTER_WEAVIATE_API_KEY>
      weaviate_cluster_url: <ENTER_WEAVIATE_CLUSTER_URL>
  agents:
  - provider_id: meta0
    provider_type: meta-reference
    config:
      persistence_store:
        namespace: null
        type: sqlite
        db_path: ~/.llama/runtime/kvstore.db
  telemetry:
  - provider_id: meta0
    provider_type: meta-reference
    config: {}
--- a/llama_stack/distribution/build_container.sh
+++ b/llama_stack/distribution/build_container.sh
@ -15,7 +15,7 @@ special_pip_deps="$6"
 set -euo pipefail
 build_name="$1"
-image_name="llamastack-$build_name"
+image_name="distribution-$build_name"
 docker_base=$2
 build_file_path=$3
 host_build_dir=$4
--- a/llama_stack/providers/registry/inference.py
+++ b/llama_stack/providers/registry/inference.py
@ -55,7 +55,7 @@ def available_providers() -> List[ProviderSpec]:
            api=Api.inference,
            adapter=AdapterSpec(
                adapter_type="ollama",
-                pip_packages=["ollama"],
+                pip_packages=["ollama", "aiohttp"],
                config_class="llama_stack.providers.adapters.inference.ollama.OllamaImplConfig",
                module="llama_stack.providers.adapters.inference.ollama",
            ),