adjusted based on latest feedback

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
This commit is contained in:
Francisco Javier Arceo 2025-04-07 23:40:40 -04:00
parent 5a7572706a
commit a23e5046ee

View file

@ -10,9 +10,12 @@ Llama Stack is a stateful service with REST APIs to support seamless transition
In this guide, we'll walk through how to build a RAG agent locally using Llama Stack with [Ollama](https://ollama.com/) In this guide, we'll walk through how to build a RAG agent locally using Llama Stack with [Ollama](https://ollama.com/)
as the inference [provider](../providers/index.md#inference) for a Llama Model. as the inference [provider](../providers/index.md#inference) for a Llama Model.
```{admonition} Note
:class: tip
These instructions outlined are for a
```
## Step 1: Installation and Setup ## Step 1: Installation and Setup
### i. Install and Start Ollama for Inference ### i. Install and Start Ollama for Inference
Install Ollama by following the instructions on the [Ollama website](https://ollama.com/download). Install Ollama by following the instructions on the [Ollama website](https://ollama.com/download).
@ -60,11 +63,14 @@ uv pip install llama-stack
``` ```
Note the Llama Stack Server includes the client SDK as well. Note the Llama Stack Server includes the client SDK as well.
## Step 3: Build and Run Llama Stack ## Step 3: Build and Run the Llama Stack Server
Llama Stack uses a [configuration file](../distributions/configuration.md) to define the stack. Llama Stack uses a [configuration file](../distributions/configuration.md) to define the stack.
The config file is a YAML file that specifies the providers and their configurations. The config file is a YAML file that specifies the providers and their configurations.
### i. Build and Run the Llama Stack Config for Ollama ### i. Build and Run the Llama Stack Config for Ollama
::::{tab-set}
:::{tab-item} Using Python
```bash ```bash
INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type venv --run INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type venv --run
``` ```
@ -73,18 +79,94 @@ You will see output like below:
INFO: Application startup complete. INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
``` ```
:::
:::{tab-item} Using a Container
To get started quickly, we provide various container images for the server component that work with different inference
providers out of the box. For this guide, we will use `llamastack/distribution-ollama` as the container image. If you'd
like to build your own image or customize the configurations, please check out [this guide](../references/index.md).
Lets setup some environment variables and create a local directory to mount into the containers file system.
```bash
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export LLAMA_STACK_PORT=8321
mkdir -p ~/.llama
```
Then start the server using the container tool of your choice. For example, if you are running Docker you can use the
following command:
```bash
docker run -it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-ollama \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://host.docker.internal:11434
```
Note to start the container with Podman, you can do the same but replace `docker` at the start of the command with
`podman`. If you are using `podman` older than `4.7.0`, please also replace `host.docker.internal` in the `OLLAMA_URL`
with `host.containers.internal`.
As another example, to start the container with Podman, you can do the same but replace `docker` at the start of the command with `podman`. If you are using `podman` older than `4.7.0`, please also replace `host.docker.internal` in the `OLLAMA_URL` with `host.containers.internal`.
Configuration for this is available at `distributions/ollama/run.yaml`.
```{admonition} Note
:class: note
Docker containers run in their own isolated network namespaces on Linux. To allow the container to communicate with services running on the host via `localhost`, you need `--network=host`. This makes the container use the hosts network directly so it can connect to Ollama running on `localhost:11434`.
Linux users having issues running the above command should instead try the following:
```bash
docker run -it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
--network=host \
llamastack/distribution-ollama \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://localhost:11434
```
:::
::::
### ii. Using the Llama Stack Client ### ii. Using the Llama Stack Client
Now you can use the llama stack client to run inference and build agents! You can reuse the server setup or use the Now you can use the llama stack client to run inference and build agents!
[Llama Stack Client](https://github.com/meta-llama/llama-stack-client-python/). Note that the client package is already You can reuse the server setup or use the [Llama Stack Client](https://github.com/meta-llama/llama-stack-client-python/).
included in the `llama-stack` package. Note that the client package is already included in the `llama-stack` package.
Open a new terminal and navigate to the same directory you started the server from. Then set up or activate your Open a new terminal and navigate to the same directory you started the server from. Then set up a new or activate your
virtual environment. existing server virtual environment.
::::{tab-set}
:::{tab-item} Reuse the Server Setup
```bash ```bash
source .venv/bin/activate source .venv/bin/activate
``` ```
:::
:::{tab-item} Install the Llama Stack Client (venv)
```bash
uv venv client --python 3.10
source client/bin/activate
pip install llama-stack-client
```
:::
:::{tab-item} Install the Llama Stack Client (conda)
```bash
yes | conda create -n stack-client python=3.10
conda activate stack-client
pip install llama-stack-client
```
:::
::::
Now let's use the `llama-stack-client` CLI to check the connectivity to the server. Now let's use the `llama-stack-client` CLI to check the connectivity to the server.
```bash ```bash
@ -95,7 +177,7 @@ You will see the below:
Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:8321 Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:8321
``` ```
#### iii. List available models #### iii. List Available Models
List the models List the models
``` ```
llama-stack-client models list llama-stack-client models list
@ -163,6 +245,7 @@ response = client.inference.chat_completion(
) )
print(response.completion_message.content) print(response.completion_message.content)
``` ```
### ii. Run the Script ### ii. Run the Script
Let's run the script using `uv` Let's run the script using `uv`
```bash ```bash
@ -432,7 +515,8 @@ Let's run the script using `uv`
```bash ```bash
uv run python rag_agent.py uv run python rag_agent.py
``` ```
Which will output: :::{dropdown} `👋 Click here to see the sample output`
``` ```
user> what is torchtune user> what is torchtune
inference> [knowledge_search(query='TorchTune')] inference> [knowledge_search(query='TorchTune')]
@ -446,6 +530,7 @@ PyTorch Tune provides a recipe for LoRA (Low-Rank Adaptation) finetuning, which
... ...
Overall, DORA is a powerful reinforcement learning algorithm that can learn complex tasks from human demonstrations. However, it requires careful consideration of the challenges and limitations to achieve optimal results. Overall, DORA is a powerful reinforcement learning algorithm that can learn complex tasks from human demonstrations. However, it requires careful consideration of the challenges and limitations to achieve optimal results.
``` ```
:::
Congrats! 🥳 Now you're ready to build your own Llama Stack applications! 🚀 Congrats! 🥳 Now you're ready to build your own Llama Stack applications! 🚀