Update Fireworks + Togther documentation

2025-12-17 16:42:44 +00:00 · 2024-11-18 12:52:23 -08:00 · 2024-11-18 12:52:23 -08:00 · a562668dcd
commit a562668dcd
parent 1ecaf2cb3c
27 changed files with 879 additions and 445 deletions
--- a/llama_stack/templates/ollama/doc_template.md
+++ b/llama_stack/templates/ollama/doc_template.md
@ -6,103 +6,106 @@ The `llamastack/distribution-{{ name }}` distribution consists of the following

 You should use this distribution if you have a regular desktop machine without very powerful GPUs. Of course, if you have powerful GPUs, you can still continue using this distribution since Ollama supports GPU acceleration.

-{%- if docker_compose_env_vars %}
+{%- if run_config_env_vars %}
 ### Environment Variables

 The following environment variables can be configured:

-{% for var, (default_value, description) in docker_compose_env_vars.items() %}
+{% for var, (default_value, description) in run_config_env_vars.items() %}
 - `{{ var }}`: {{ description }} (default: `{{ default_value }}`)
 {% endfor %}
 {% endif %}

-{%- if default_models %}
-### Models

-The following models are configured by default:
-{% for model in default_models %}
- `{{ model.model_id }}`
-{% endfor %}
-{% endif %}
+## Setting up Ollama server

-## Using Docker Compose
+Please check the [Ollama Documentation](https://github.com/ollama/ollama) on how to install and run Ollama. After installing Ollama, you need to run `ollama serve` to start the server.

-You can use `docker compose` to start a Ollama server and connect with Llama Stack server in a single command.
+In order to load models, you can run:

 ```bash
-$ cd distributions/{{ name }}; docker compose up
+export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
+
+# ollama names this model differently, and we must use the ollama name when loading the model
+export OLLAMA_INFERENCE_MODEL="llama3.2:3b-instruct-fp16"
+ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m
 ```

-You will see outputs similar to following ---
+If you are using Llama Stack Safety / Shield APIs, you will also need to pull and run the safety model.
+
 ```bash
-[ollama]               | [GIN] 2024/10/18 - 21:19:41 | 200 |     226.841µs |             ::1 | GET      "/api/ps"
-[ollama]               | [GIN] 2024/10/18 - 21:19:42 | 200 |      60.908µs |             ::1 | GET      "/api/ps"
-INFO:     Started server process [1]
-INFO:     Waiting for application startup.
-INFO:     Application startup complete.
-INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
-[llamastack] | Resolved 12 providers
-[llamastack] |  inner-inference => ollama0
-[llamastack] |  models => __routing_table__
-[llamastack] |  inference => __autorouted__
+export SAFETY_MODEL="meta-llama/Llama-Guard-3-1B"
+
+# ollama names this model differently, and we must use the ollama name when loading the model
+export OLLAMA_SAFETY_MODEL="llama-guard3:1b"
+ollama run $OLLAMA_SAFETY_MODEL --keepalive 60m
 ```

-To kill the server
+## Running Llama Stack
+
+Now you are ready to run Llama Stack with Ollama as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
+
+### Via Docker
+
+This method allows you to get started quickly without having to build the distribution code.
+
 ```bash
-docker compose down
+LLAMA_STACK_PORT=5001
+docker run \
+  -it \
+  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
+  -v ~/.llama:/root/.llama \
+  -v ./run.yaml:/root/my-run.yaml \
+  --gpus=all \
+  llamastack/distribution-{{ name }} \
+  /root/my-run.yaml \
+  --port $LLAMA_STACK_PORT \
+  --env INFERENCE_MODEL=$INFERENCE_MODEL \
+  --env OLLAMA_URL=http://host.docker.internal:11434
 ```

-## Starting Ollama and Llama Stack separately
+If you are using Llama Stack Safety / Shield APIs, use:

-If you wish to separately spin up a Ollama server, and connect with Llama Stack, you should use the following commands.
-
-#### Start Ollama server
- Please check the [Ollama Documentation](https://github.com/ollama/ollama) for more details.
-
-**Via Docker**
 ```bash
-docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
+docker run \
+  -it \
+  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
+  -v ~/.llama:/root/.llama \
+  -v ./run-with-safety.yaml:/root/my-run.yaml \
+  --gpus=all \
+  llamastack/distribution-{{ name }} \
+  /root/my-run.yaml \
+  --port $LLAMA_STACK_PORT \
+  --env INFERENCE_MODEL=$INFERENCE_MODEL \
+  --env SAFETY_MODEL=$SAFETY_MODEL \
+  --env OLLAMA_URL=http://host.docker.internal:11434
 ```

-**Via CLI**
-```bash
-ollama run <model_id>
-```
+### Via Conda

-#### Start Llama Stack server pointing to Ollama server
-
-**Via Conda**
+Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.

 ```bash
 llama stack build --template ollama --image-type conda
-llama stack run run.yaml
+llama stack run ./run.yaml \
+  --port 5001 \
+  --env INFERENCE_MODEL=$INFERENCE_MODEL \
+  --env OLLAMA_URL=http://127.0.0.1:11434
 ```

-**Via Docker**
-```
-docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./gpu/run.yaml:/root/llamastack-run-ollama.yaml --gpus=all llamastack/distribution-ollama --yaml_config /root/llamastack-run-ollama.yaml
-```
-
-Make sure in your `run.yaml` file, your inference provider is pointing to the correct Ollama endpoint. E.g.
-```yaml
-inference:
-  - provider_id: ollama0
-    provider_type: remote::ollama
-    config:
-      url: http://127.0.0.1:14343
-```
-
-### (Optional) Update Model Serving Configuration
-
-#### Downloading model via Ollama
-
-You can use ollama for managing model downloads.
+If you are using Llama Stack Safety / Shield APIs, use:

 ```bash
-ollama pull llama3.1:8b-instruct-fp16
-ollama pull llama3.1:70b-instruct-fp16
+llama stack run ./run-with-safety.yaml \
+  --port 5001 \
+  --env INFERENCE_MODEL=$INFERENCE_MODEL \
+  --env SAFETY_MODEL=$SAFETY_MODEL \
+  --env OLLAMA_URL=http://127.0.0.1:11434
 ```

+
+### (Optional) Update Model Serving Configuration
+
 > [!NOTE]
 > Please check the [OLLAMA_SUPPORTED_MODELS](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers.remote/inference/ollama/ollama.py) for the supported Ollama models.

--- a/llama_stack/templates/ollama/ollama.py
+++ b/llama_stack/templates/ollama/ollama.py
@ -68,17 +68,17 @@ def get_distribution_template() -> DistributionTemplate:
                "5001",
                "Port for the Llama Stack distribution server",
            ),
+            "OLLAMA_URL": (
+                "http://127.0.0.1:11434",
+                "URL of the Ollama server",
+            ),
            "INFERENCE_MODEL": (
                "meta-llama/Llama-3.2-3B-Instruct",
-                "Inference model loaded into the TGI server",
-            ),
-            "OLLAMA_URL": (
-                "http://host.docker.internal:11434",
-                "URL of the Ollama server",
+                "Inference model loaded into the Ollama server",
            ),
            "SAFETY_MODEL": (
                "meta-llama/Llama-Guard-3-1B",
-                "Name of the safety (Llama-Guard) model to use",
+                "Safety model loaded into the Ollama server",
            ),
        },
    )