forked from phoenix-oss/llama-stack-mirror

History

Xi Yan 4d2bd2d39e add more distro templates (#279 ) * verify dockers * together distro verified * readme * fireworks distro * fireworks compose up * fireworks verified		2024-10-21 18:15:08 -07:00
..
cpu	llama stack distributions / templates / docker refactor (#266 )	2024-10-21 11:17:53 -07:00
gpu	add more distro templates (#279 )	2024-10-21 18:15:08 -07:00
build.yaml	add more distro templates (#279 )	2024-10-21 18:15:08 -07:00
README.md	add more distro templates (#279 )	2024-10-21 18:15:08 -07:00

README.md

Ollama Distribution

The llamastack/distribution-ollama distribution consists of the following provider configurations.

API	Inference	Agents	Memory	Safety	Telemetry
Provider(s)	remote::ollama	meta-reference	remote::pgvector, remote::chroma	remote::ollama	meta-reference

Start a Distribution (Single Node GPU)

Note

This assumes you have access to GPU to start a Ollama server with access to your GPU.

$ cd llama-stack/distribution/ollama/gpu
$ ls
compose.yaml  run.yaml
$ docker compose up

You will see outputs similar to following ---

[ollama]               | [GIN] 2024/10/18 - 21:19:41 | 200 |     226.841µs |             ::1 | GET      "/api/ps"
[ollama]               | [GIN] 2024/10/18 - 21:19:42 | 200 |      60.908µs |             ::1 | GET      "/api/ps"
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://[::]:5000 (Press CTRL+C to quit)
[llamastack] | Resolved 12 providers
[llamastack] |  inner-inference => ollama0
[llamastack] |  models => __routing_table__
[llamastack] |  inference => __autorouted__

To kill the server

docker compose down

Start the Distribution (Single Node CPU)

Note

This will start an ollama server with CPU only, please see Ollama Documentations for serving models on CPU only.

$ cd llama-stack/distribution/ollama/cpu
$ ls
compose.yaml  run.yaml
$ docker compose up

(Alternative) ollama run + llama stack Run

If you wish to separately spin up a Ollama server, and connect with Llama Stack, you may use the following commands.

Start Ollama server.

Please check the Ollama Documentations for more details.

Via Docker

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Via CLI

ollama run <model_id>

Start Llama Stack server pointing to Ollama server

Via Docker

docker run --network host -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./gpu/run.yaml:/root/llamastack-run-ollama.yaml --gpus=all distribution-ollama --yaml_config /root/llamastack-run-ollama.yaml

Make sure in you run.yaml file, you inference provider is pointing to the correct Ollama endpoint. E.g.

inference:
  - provider_id: ollama0
    provider_type: remote::ollama
    config:
      url: http://127.0.0.1:14343

Via Conda

llama stack build --config ./build.yaml
llama stack run ./gpu/run.yaml