mirror of https://github.com/meta-llama/llama-stack.git synced 2025-07-06 14:00:42 +00:00

History

Sumanth Kamenani 577ec382e1 fix(docs): update Agents101 notebook for builtin websearch (#2591 ) - Switch from BRAVE_SEARCH_API_KEY to TAVILY_SEARCH_API_KEY - Add provider_data to LlamaStackClient for API key passing - Use builtin::websearch toolgroup instead of manual tool config - Fix message types to use UserMessage instead of plain dict - Add streaming support with proper type casting - Remove async from EventLogger loop (bug fix) Fixes websearch functionality in agents tutorial by properly configuring Tavily search provider integration. # What does this PR do? Fixes the Agents101 tutorial notebook to work with the current Llama Stack websearch implementation. The tutorial was using outdated Brave Search configuration that no longer works with the current server setup. Key Changes: - Switch API provider: Change from `BRAVE_SEARCH_API_KEY` to `TAVILY_SEARCH_API_KEY` to match server configuration - Fix client setup: Add `provider_data` to `LlamaStackClient` to properly pass API keys to server - Modernize tool usage: Replace manual tool configuration with `tools=["builtin::websearch"]` - Fix type safety: Use `UserMessage` type instead of plain dictionaries for messages - Fix streaming: Add proper streaming support with `stream=True` and type casting - Fix EventLogger: Remove incorrect `async for` usage (should be `for`) Why needed: Users following the tutorial were getting 401 Unauthorized errors because the notebook wasn't properly configured for the Tavily search provider that the server actually uses. ## Test Plan Prerequisites: 1. Start Llama Stack server with Ollama template and `TAVILY_SEARCH_API_KEY` environment variable 2. Set `TAVILY_SEARCH_API_KEY` in your `.env` file Testing Steps: 1. Clone and setup: ```bash git checkout fix-2558-update-agents101 cd docs/zero_to_hero_guide/ ``` 2. Start server with API key: ```bash export TAVILY_SEARCH_API_KEY="your_tavily_api_key" podman run -it --network=host -v ~/.llama:/root/.llama:Z \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ --env OLLAMA_URL=http://localhost:11434 \ --env TAVILY_SEARCH_API_KEY=$TAVILY_SEARCH_API_KEY \ llamastack/distribution-ollama --port $LLAMA_STACK_PORT ``` 3. Run the notebook: - Open `07_Agents101.ipynb` in Jupyter - Execute all cells in order - Cell 5 should run without errors and show successful web search results Expected Results: - ✅ No 401 Unauthorized errors - ✅ Agent successfully calls `brave_search.call()` with web results - ✅ Switzerland travel recommendations appear in output - ✅ Follow-up questions work correctly Before this fix: Users got `401 Unauthorized` errors and tutorial failed After this fix: Tutorial works end-to-end with proper web search functionality Tested with: - Tavily API key (free tier) - Ollama distribution template - Llama-3.2-3B-Instruct model		2025-07-03 11:14:51 +02:00
..
.env.template	Docs improvement v3 (#433 )	2024-11-22 15:43:31 -08:00
00_Inference101.ipynb	fix: update zero-to-hero guide for modern llama stack (#2555 )	2025-06-30 18:09:33 -07:00
01_Local_Cloud_Inference101.ipynb	fix: update zero-to-hero guide for modern llama stack (#2555 )	2025-06-30 18:09:33 -07:00
02_Prompt_Engineering101.ipynb	fix: specify nbformat version in nb (#2023 )	2025-04-25 10:10:37 +02:00
03_Image_Chat101.ipynb	fix: update zero-to-hero guide for modern llama stack (#2555 )	2025-06-30 18:09:33 -07:00
04_Tool_Calling101.ipynb	fix: update zero_to_hero package and README (#2578 )	2025-07-01 11:08:55 -07:00
05_Memory101.ipynb	fix: update zero-to-hero guide for modern llama stack (#2555 )	2025-06-30 18:09:33 -07:00
06_Safety101.ipynb	fix: specify nbformat version in nb (#2023 )	2025-04-25 10:10:37 +02:00
07_Agents101.ipynb	fix(docs): update Agents101 notebook for builtin websearch (#2591 )	2025-07-03 11:14:51 +02:00
README.md	fix: update zero_to_hero package and README (#2578 )	2025-07-01 11:08:55 -07:00
Tool_Calling101_Using_Together_Llama_Stack_Server.ipynb	docs: update test_agents to use new Agent SDK API (#1402 )	2025-03-06 15:21:12 -08:00

README.md

Llama Stack: from Zero to Hero

Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Providers providing their implementations. These building blocks are assembled into Distributions which are easy for developers to get from zero to production.

This guide will walk you through an end-to-end workflow with Llama Stack with Ollama as the inference provider and ChromaDB as the VectorIO provider. Please note the steps for configuring your provider and distribution will vary depending on the services you use. However, the user experience will remain universal - this is the power of Llama-Stack.

If you're looking for more specific topics, we have a Zero to Hero Guide that covers everything from 'Tool Calling' to 'Agents' in detail. Feel free to skip to the end to explore the advanced topics you're interested in.

If you'd prefer not to set up a local server, explore our notebook on tool calling with the Together API. This notebook will show you how to leverage together.ai's Llama Stack Server API, allowing you to get started with Llama Stack without the need for a locally built and running server.

Setup and run ollama
Install Dependencies and Set Up Environment
Build, Configure, and Run Llama Stack
Test with llama-stack-client CLI
Test with curl
Test with Python
Next Steps

Setup ollama

Download Ollama App:
- Go to https://ollama.com/download.
- Follow instructions based on the OS you are on. For example, if you are on a Mac, download and unzip Ollama-darwin.zip.
- Run the Ollama application.
Download the Ollama CLI: Ensure you have the ollama command line tool by downloading and installing it from the same website.
Start ollama server: Open the terminal and run:
```
ollama serve
```
Run the model: Open the terminal and run:
```
ollama run llama3.2:3b-instruct-fp16 --keepalive -1m
```
Note:
- The supported models for llama stack for now is listed in here
- keepalive -1m is used so that ollama continues to keep the model in memory indefinitely. Otherwise, ollama frees up memory and you would have to run ollama run again.

Install Dependencies and Set Up Environment

Create a Conda Environment: Create a new Conda environment with Python 3.12:
```
conda create -n ollama python=3.12
```
Activate the environment:
```
conda activate ollama
```
Install ChromaDB: Install chromadb using pip:
```
pip install chromadb
```

Run ChromaDB: Start the ChromaDB server:

chroma run --host localhost --port 8000 --path ./my_chroma_data

Install Llama Stack: Open a new terminal and install llama-stack:
```
conda activate ollama
pip install -U llama-stack
```

Build, Configure, and Run Llama Stack

Build the Llama Stack: Build the Llama Stack using the ollama template:

llama stack build --template ollama --image-type conda

Expected Output:

...
Build Successful!
You can find the newly-built template here: ~/.llama/distributions/ollama/ollama-run.yaml
You can run the new Llama Stack Distro via: llama stack run ~/.llama/distributions/ollama/ollama-run.yaml --image-type conda

Set the ENV variables by exporting them to the terminal:

export OLLAMA_URL="http://localhost:11434"
export LLAMA_STACK_PORT=8321
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
export SAFETY_MODEL="meta-llama/Llama-Guard-3-1B"

Run the Llama Stack: Run the stack with command shared by the API from earlier:
```
llama stack run ollama
   --port $LLAMA_STACK_PORT
   --env INFERENCE_MODEL=$INFERENCE_MODEL
   --env SAFETY_MODEL=$SAFETY_MODEL
   --env OLLAMA_URL=$OLLAMA_URL
```
Note: Every time you run a new model with ollama run, you will need to restart the llama stack. Otherwise it won't see the new model.

The server will start and listen on http://localhost:8321.

Test with `llama-stack-client` CLI

After setting up the server, open a new terminal window and configure the llama-stack-client.

Configure the CLI to point to the llama-stack server.

llama-stack-client configure --endpoint http://localhost:8321

Expected Output:

Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:8321

Test the CLI by running inference:

llama-stack-client inference chat-completion --message "Write me a 2-sentence poem about the moon"

Expected Output:

ChatCompletionResponse(
    completion_message=CompletionMessage(
        content='Here is a 2-sentence poem about the moon:\n\nSilver crescent shining bright in the night,\nA beacon of wonder, full of gentle light.',
        role='assistant',
        stop_reason='end_of_turn',
        tool_calls=[]
    ),
    logprobs=None
)

Test with `curl`

After setting up the server, open a new terminal window and verify it's working by sending a POST request using curl:

curl http://localhost:$LLAMA_STACK_PORT/alpha/inference/chat-completion
-H "Content-Type: application/json"
-d @- <<EOF
{
    "model_id": "$INFERENCE_MODEL",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write me a 2-sentence poem about the moon"}
    ],
    "sampling_params": {
      "strategy": {
         "type": "top_p",
         "temperatrue": 0.7,
         "top_p": 0.95,
      },
      "seed": 42,
      "max_tokens": 512
   }
}
EOF

You can check the available models with the command llama-stack-client models list.

Expected Output:

{
  "completion_message": {
    "role": "assistant",
    "content": "The moon glows softly in the midnight sky,\nA beacon of wonder, as it catches the eye.",
    "stop_reason": "out_of_tokens",
    "tool_calls": []
  },
  "logprobs": null
}

Test with Python

You can also interact with the Llama Stack server using a simple Python script. Below is an example:

1. Activate Conda Environment

conda activate ollama

2. Create Python Script (`test_llama_stack.py`)

touch test_llama_stack.py

3. Create a Chat Completion Request in Python

In test_llama_stack.py, write the following code:

import os
from llama_stack_client import LlamaStackClient

# Get the model ID from the environment variable
INFERENCE_MODEL = os.environ.get("INFERENCE_MODEL")

# Check if the environment variable is se
if INFERENCE_MODEL is None:
    raise ValueError("The environment variable 'INFERENCE_MODEL' is not set.")

# Initialize the clien
client = LlamaStackClient(base_url="http://localhost:8321")

# Create a chat completion reques
response = client.inference.chat_completion(
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "Write a two-sentence poem about llama."},
    ],
    model_id=INFERENCE_MODEL,
)

# Print the response
print(response.completion_message.content)

4. Run the Python Script

python test_llama_stack.py

Expected Output:

The moon glows softly in the midnight sky,
A beacon of wonder, as it catches the eye.

With these steps, you should have a functional Llama Stack setup capable of generating text using the specified model. For more detailed information and advanced configurations, refer to some of our documentation below.

This command initializes the model to interact with your local Llama Stack instance.

Next Steps

Explore Other Guides: Dive deeper into specific topics by following these guides:

Explore Client SDKs: Utilize our client SDKs for various languages to integrate Llama Stack into your applications:

Advanced Configuration: Learn how to customize your Llama Stack distribution by referring to the Building a Llama Stack Distribution guide.

Explore Example Apps: Check out llama-stack-apps for example applications built using Llama Stack.

README.md

Llama Stack: from Zero to Hero

Table of Contents

Setup ollama

Install Dependencies and Set Up Environment

Build, Configure, and Run Llama Stack

Test with llama-stack-client CLI

Test with curl

Test with Python

1. Activate Conda Environment

2. Create Python Script (test_llama_stack.py)

3. Create a Chat Completion Request in Python

4. Run the Python Script

Next Steps

Test with `llama-stack-client` CLI

Test with `curl`

2. Create Python Script (`test_llama_stack.py`)