Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
This commit is contained in:
Francisco Javier Arceo 2025-04-06 22:54:40 -04:00
parent f2b0c282ed
commit 1639fd8b75

View file

@ -1,100 +1,110 @@
# Quick Start # Quick Start
In this guide, we'll walk through how you can use the Llama Stack (server and client SDK) to test a simple RAG agent.
A Llama Stack agent is a simple integrated system that can perform tasks by combining a Llama model for reasoning with
tools (e.g., RAG, web search, code execution, etc.) for taking actions.
In Llama Stack, we provide a server exposing multiple APIs. These APIs are backed by implementations from different providers.
Llama Stack is a stateful service with REST APIs to support seamless transition of AI applications across different environments. The server can be run in a variety of ways, including as a standalone binary, Docker container, or hosted service. You can build and test using a local server first and deploy to a hosted endpoint for production. Llama Stack is a stateful service with REST APIs to support seamless transition of AI applications across different environments. The server can be run in a variety of ways, including as a standalone binary, Docker container, or hosted service. You can build and test using a local server first and deploy to a hosted endpoint for production.
In this guide, we'll walk through how to build a RAG agent locally using Llama Stack with [Ollama](https://ollama.com/) to run inference on a Llama Model. In this guide, we'll walk through how to build a RAG agent locally using Llama Stack with [Ollama](https://ollama.com/)
as the inference [provider](../providers/index.md#inference) for a Llama Model.
## Step 1: Installation and Setup
### 1. Start Ollama ### i. Install and Start Ollama for Inference
Install Ollama by following the instructions on the [Ollama website](https://ollama.com/download).
To start Ollama run:
```bash ```bash
ollama run llama3.2:3b --keepalive 60m ollama run llama3.2:3b --keepalive 60m
``` ```
By default, Ollama keeps the model loaded in memory for 5 minutes which can be too short. We set the `--keepalive` flag to 60 minutes to ensure the model remains loaded for sometime. By default, Ollama keeps the model loaded in memory for 5 minutes which can be too short. We set the `--keepalive` flag to 60 minutes to ensure the model remains loaded for sometime.
```{admonition} Note ### ii. Install `uv` to Manage your Python packages
:class: tip
If you do not have ollama, you can install it from [here](https://ollama.com/download). Install [uv](https://docs.astral.sh/uv/) to setup your virtual environment
```
### 2. Run Llama Stack locally ::::{tab-set}
We use `uv` to setup a virtual environment and install the Llama Stack package. :::{tab-item} macOS and Linux
Use `curl` to download the script and execute it with `sh`:
:::{dropdown} [Click to Open] Instructions to setup uv ```console
Install [uv](https://docs.astral.sh/uv/) to setup your virtual environment.
#### For macOS and Linux:
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh curl -LsSf https://astral.sh/uv/install.sh | sh
``` ```
#### For Windows: :::
:::{tab-item} Windows
Use `irm` to download the script and execute it with `iex`: Use `irm` to download the script and execute it with `iex`:
```powershell
```console
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
``` ```
:::
::::
Setup venv ### iii. Setup your Virtual Environment
```bash ```bash
uv venv --python 3.10 uv venv --python 3.10
source .venv/bin/activate source .venv/bin/activate
``` ```
::: ## Step 2: Install Llama Stack
Llama Stack is a server that exposes multiple APIs, you connect with it using the Llama Stack client SDK.
**Install the Llama Stack package** ### Install the Llama Stack Server
```bash ```bash
uv pip install -U llama-stack uv pip install llama-stack
``` ```
**Build and Run the Llama Stack server for Ollama.** ### Install the Llama Stack Client
```bash
uv pip install llama-stack-client
```
## Step 3: Build and Run Llama Stack
Llama Stack uses a [configuration file](../distributions/configuration.md) to define the stack.
The config file is a YAML file that specifies the providers and their configurations.
### i. Build and Run the Llama Stack Config for Ollama
```bash ```bash
INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type venv --run INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type venv --run
``` ```
You will see the output end like below: You will see output like below:
``` ```
... ...
INFO: Application startup complete. INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
``` ```
### ii. Using the Llama Stack Client
Now you can use the llama stack client to run inference and build agents! Now you can use the llama stack client to run inference and build agents!
### 3. Client CLI :::{dropdown} You can reuse the server setup or the Llama Stack Client
Install the client package
```bash
pip install llama-stack-client
```
:::{dropdown} OR reuse server setup
Open a new terminal and navigate to the same directory you started the server from. Open a new terminal and navigate to the same directory you started the server from.
Setup venv (llama-stack already includes the llama-stack-client package) Setup venv (llama-stack already includes the client package)
```bash ```bash
source .venv/bin/activate source .venv/bin/activate
``` ```
::: Let's use the `llama-stack-client` CLI to check the connectivity to the server.
#### 3.1 Configure the client to point to the local server
```bash ```bash
llama-stack-client configure --endpoint http://localhost:8321 --api-key none llama-stack-client configure --endpoint http://localhost:$LLAMA_STACK_PORT --api-key none
``` ```
You will see the below: You will see the below:
``` ```
Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:8321 Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:8321
``` ```
#### 3.2 List available models #### iii. List available models
List the models
``` ```
llama-stack-client models list llama-stack-client models list
```
``` ```
Available Models Available Models
@ -110,7 +120,9 @@ Total models: 2
``` ```
#### 3.3 Test basic inference ## Step 4: Run Inference with Llama Stack
You can test basic Llama inference completion using the CLI too.
```bash ```bash
llama-stack-client inference chat-completion --message "tell me a joke" llama-stack-client inference chat-completion --message "tell me a joke"
``` ```
@ -132,19 +144,6 @@ ChatCompletionResponse(
) )
``` ```
### 4. Python SDK
Install the python client
```bash
pip install llama-stack-client
```
:::{dropdown} OR reuse server setup
Open a new terminal and navigate to the same directory you started the server from.
Setup venv (llama-stack already includes the llama-stack-client package)
```bash
source .venv/bin/activate
```
:::
#### 4.1 Basic Inference #### 4.1 Basic Inference
Create a file `inference.py` and add the following code: Create a file `inference.py` and add the following code:
```python ```python
@ -170,11 +169,11 @@ response = client.inference.chat_completion(
) )
print(response.completion_message.content) print(response.completion_message.content)
``` ```
Run the script Let's run the script using `uv`
```bash ```bash
python inference.py uv run python inference.py
``` ```
Sample output: Which will output:
``` ```
Model: llama3.2:3b-instruct-fp16 Model: llama3.2:3b-instruct-fp16
Here is a haiku about coding: Here is a haiku about coding:
@ -226,9 +225,9 @@ for event in AgentEventLogger().log(stream):
event.print() event.print()
``` ```
Run the script: Let's run the script using `uv`
```bash ```bash
python agent.py uv run python agent.py
``` ```
:::{dropdown} `Sample output` :::{dropdown} `Sample output`
@ -419,19 +418,23 @@ ragagent = Agent(
s_id = ragagent.create_session(session_name=f"s{uuid.uuid4().hex}") s_id = ragagent.create_session(session_name=f"s{uuid.uuid4().hex}")
turns = ["what is torchtune", "tell me about dora"] user_prompts = [
"How to optimize memory usage in torchtune? use the knowledge_search tool to get information.",
]
for t in turns: # Run the agent loop by calling the `create_turn` method
print("user>", t) for prompt in user_prompts:
stream = ragagent.create_turn( cprint(f"User> {prompt}", "green")
messages=[{"role": "user", "content": t}], session_id=s_id, stream=True response = rag_agent.create_turn(
messages=[{"role": "user", "content": prompt}],
session_id=session_id,
) )
for event in AgentEventLogger().log(stream): for event in AgentEventLogger().log(stream):
event.print() event.print()
``` ```
Run the script: Let's run the script using `uv`
``` ```bash
python rag_agent.py uv run python lsagent.py
``` ```
:::{dropdown} `Sample output` :::{dropdown} `Sample output`
``` ```
@ -451,5 +454,7 @@ Overall, DORA is a powerful reinforcement learning algorithm that can learn comp
## Next Steps ## Next Steps
- Go through the [Getting Started Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) - Go through the [Getting Started Notebook](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)
- Checkout more [Notebooks on GitHub](https://github.com/meta-llama/llama-stack/tree/main/docs/notebooks) - Checkout more [Notebooks on GitHub](https://github.com/meta-llama/llama-stack/tree/main/docs/notebooks)
- Learn more about Llama Stack [Concepts](../concepts/index.md)
- Learn how to [Build Llama Stacks](../distributions/index.md)
- See [References](../references/index.md) for more details about the llama CLI and Python SDK - See [References](../references/index.md) for more details about the llama CLI and Python SDK
- For example applications and more detailed tutorials, visit our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repository. - For example applications and more detailed tutorials, visit our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repository.