[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)

# Llama Stack - Building AI Applications

<img src="https://llama-stack.readthedocs.io/en/latest/_images/llama-stack.png" alt="drawing" width="500"/>

[Llama Stack](https://github.com/meta-llama/llama-stack) defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.

Read more about the project here: https://llama-stack.readthedocs.io/en/latest/index.html

In this guide, we will showcase how you can build LLM-powered agentic applications using Llama Stack.

**ðŸ’¡ Quick Start Option:** If you want a simpler and faster way to test out Llama Stack, check out the [quick_start.ipynb](quick_start.ipynb) notebook instead. It provides a streamlined experience for getting up and running in just a few steps.


## 1. Getting started with Llama Stack

### 1.1. Setup API keys


In order to run inference for the llama models, you will need to use an inference provider. Llama stack supports a number of inference [providers](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/remote/inference).


In this showcase, we will use [together.ai](https://www.together.ai/) as the inference provider. So, you would first get an API key from Together if you dont have one already.

Steps [here](https://docs.google.com/document/d/1Vg998IjRW_uujAPnHdQ9jQWvtmkZFt74FldW2MblxPY/edit?usp=sharing).

You can also use Fireworks.ai or even Ollama if you would like to.



To set up the API keys for Together and Tavily Search, you will use Google Colab's user data secrets feature.

1. Click on the "ðŸ”‘" icon in the left sidebar to open the secrets manager.
2. Add your `TOGETHER_API_KEY` and `TAVILY_SEARCH_API_KEY` as secrets.
3. The following code will then load these secrets as environment variables.

In [1]:
import os
import getpass
try:
    from google.colab import userdata
    os.environ['TOGETHER_API_KEY'] = userdata.get('TOGETHER_API_KEY')
    os.environ['TAVILY_SEARCH_API_KEY'] = userdata.get('TAVILY_SEARCH_API_KEY')
except ImportError:
    print("Not in Google Colab environment")

for key in ['TOGETHER_API_KEY', 'TAVILY_SEARCH_API_KEY']:
    try:
        api_key = os.environ[key]
        if not api_key:
            raise ValueError(f"{key} environment variable is empty")
    except KeyError:
        api_key = getpass.getpass(f"{key} environment variable is not set. Please enter your API key: ")
        os.environ[key] = api_key

### 1.1.1 Use Ollama instead (optional)

Optionally, we can use ollama for local inference to avoid any api cost. To use Ollama as a  model provider, you need to install and run Ollama and pull the desired  model.

Here are the steps:

1. **Install Ollama:** Run the provided script to install Ollama.
2. **Start Ollama server and pull model:** Start the Ollama server and pull the `llama-guard3:1b` or `llama3.2:3b` model, which is used as the safety shield or the inference model in this notebook.
3. Set system variable `OLLAMA_URL` to `http://localhost:11434` so llama-stack knows where to connect.

In [2]:
os.environ['OLLAMA_URL'] = 'http://localhost:11434'

#Install Ollama
!curl -fsSL https://ollama.com/install.sh | sh

#Start Ollama server with llama-guard3:1b model and llama3.2:3b
!nohup ollama serve > ollama_server.log 2>&1 &
!ollama pull llama-guard3:1b
!ollama pull llama3.2:3b

>>> Cleaning up old version at /usr/local/lib/ollama
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l
[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l[1G[?25h[?2026l[?2026h[?25l

In [3]:
# Double check ollama model running
!curl 127.0.0.1:11434/v1/models


{"object":"list","data":[{"id":"llama3.2:3b","object":"model","created":1759163051,"owned_by":"library"},{"id":"llama-guard3:1b","object":"model","created":1759163021,"owned_by":"library"}]}


### 1.2. Setup and Running a Llama Stack server

Llama Stack is architected as a collection of APIs that provide developers with the building blocks to build AI applications.

Llama stack is typically available as a server with an endpoint that you can make calls to. Partners like Together and Fireworks offer their own Llama Stack compatible endpoints.

In this showcase, we will start a Llama Stack server that is running locally.


In [2]:
# Install UV if not available
!curl -LsSf https://astral.sh/uv/install.sh | sh
# Complete setup for Google Colab with custom directories
import os
!uv venv venv --clear
!source ./venv/bin/activate && uv run --with llama-stack llama stack build --distro starter --image-type venv
!nohup python -m llama_stack.core.server.server /root/.llama/distributions/starter/starter-run.yaml --port 8321 > llama_stack_server.log &
def wait_for_server_to_start():
    import requests
    from requests.exceptions import ConnectionError
    import time

    url = "http://0.0.0.0:8321/v1/health"
    max_retries = 30
    retry_interval = 1

    print("Waiting for server to start", end="")
    for _ in range(max_retries):
        try:
            response = requests.get(url)
            if response.status_code == 200:
                print("\nServer is ready!")
                return True
        except ConnectionError:
            print(".", end="", flush=True)
            time.sleep(retry_interval)

    print("\nServer failed to start after", max_retries * retry_interval, "seconds")
    return False
assert wait_for_server_to_start()
print("llama stack server hosted on localhost:8321")

downloading uv 0.8.22 x86_64-unknown-linux-gnu
no checksums to verify
installing to /usr/local/bin
  uv
  uvx
everything's installed!
Using CPython 3.12.11 interpreter at: [36m/usr/bin/python3[39m
Creating virtual environment at: [36mvenv[39m
Activate with: [32msource venv/bin/activate[39m
[2KEnvironment '/root/.cache/uv/builds-v0/.tmptHMEzy' already exists, re-using it.
Installing dependencies in system Python environment
[2mUsing Python 3.12.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 1.34s[0m[0m
Installing pip dependencies
[2mUsing Python 3.12.11 environment at: /usr[0m
[2mAudited [1m50 packages[0m [2min 288ms[0m[0m
Installing special provider module: torch torchvision torchao>=0.12.0 --extra-index-url https://download.pytorch.org/whl/cpu
[2mUsing Python 3.12.11 environment at: /usr[0m
[2mAudited [1m3 packages[0m [2min 82ms[0m[0m
Installing special provider module: torch torchtune>=0.5.0 torchao>=0.12.0 --extra-index-url https://downloa

In [3]:
!cat llama_stack_server.log


INFO     2025-09-29 16:41:56,316 llama_stack.core.utils.config_resolution:45 core: Using file path:                                                   
         /root/.llama/distributions/starter/starter-run.yaml                                                                                          
INFO     2025-09-29 16:41:56,340 __main__:593 core::server: Run configuration:                                                                        
INFO     2025-09-29 16:41:56,349 __main__:596 core::server: apis:                                                                                     
         - agents                                                                                                                                     
         - batches                                                                                                                                    
         - datasetio                                                                          

### 1.4. Install and Configure the Client

Now that we have our Llama Stack server running locally, we need to install the client package to interact with it. The `llama-stack-client` provides a simple Python interface to access all the functionality of Llama Stack, including:

- Chat Completions ( text and multimodal )
- Safety Shields
- Agent capabilities with tools like web search, RAG using Response API

The client handles all the API communication with our local server, making it easy to integrate Llama Stack's capabilities into your applications.

In the next cells, we'll:

1. Install the client package
2. Set up API keys for external services (Together and Tavily Search)
3. Initialize the client to connect to our local server


In [4]:
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(
    base_url="http://0.0.0.0:8321",
    provider_data = {
        "tavily_search_api_key": os.environ['TAVILY_SEARCH_API_KEY'],
        "TOGETHER_API_KEY": os.environ['TOGETHER_API_KEY']
    }
)

Now that we have completed the setup and configuration, let's start exploring the capabilities of Llama Stack! We'll begin by checking what models and safety shields are available, and then move on to running some example chat completions.



### 1.5. Check available models and shields

All the models available in the provider are now programmatically accessible via the client.

In [5]:
from rich.pretty import pprint

print("Available models:")
for m in client.models.list():
    print(f"- {m.identifier}")



Available models:
- ollama/llama-guard3:1b
- ollama/llama3.2:3b
- bedrock/meta.llama3-1-8b-instruct-v1:0
- bedrock/meta.llama3-1-70b-instruct-v1:0
- bedrock/meta.llama3-1-405b-instruct-v1:0
- sentence-transformers/all-MiniLM-L6-v2
- together/Alibaba-NLP/gte-modernbert-base
- together/arcee-ai/AFM-4.5B
- together/arcee-ai/coder-large
- together/arcee-ai/maestro-reasoning
- together/arcee-ai/virtuoso-large
- together/arcee_ai/arcee-spotlight
- together/arize-ai/qwen-2-1.5b-instruct
- together/BAAI/bge-base-en-v1.5
- together/BAAI/bge-large-en-v1.5
- together/black-forest-labs/FLUX.1-dev
- together/black-forest-labs/FLUX.1-dev-lora
- together/black-forest-labs/FLUX.1-kontext-dev
- together/black-forest-labs/FLUX.1-kontext-max
- together/black-forest-labs/FLUX.1-kontext-pro
- together/black-forest-labs/FLUX.1-krea-dev
- together/black-forest-labs/FLUX.1-pro
- together/black-forest-labs/FLUX.1-schnell
- together/black-forest-labs/FLUX.1-schnell-Free
- together/black-forest-labs/FLUX.1.1-pro

### 1.6. Run a simple chat completion with one of the models

We will test the client by doing a simple chat completion.

In [7]:
model_id = "together/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
#If you want to use ollama, uncomment the following
#model_id = "ollama/llama3.2:3b"
response = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "Write a two-sentence poem about llama."},
    ],
    stream=False
)

print(response.choices[0].message.content)


Here is a two-sentence poem about llamas:

Softly steps the llama's gentle pace, with fur so soft and a gentle face. In the Andes' high and misty space, the llama roams with a peaceful grace.


### 1.7. Have a conversation

Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session.

In [8]:
from termcolor import cprint

questions = [
    "Who was the most famous PM of England during world war 2 ?",
    "What was his most famous quote ?"
]


def chat_loop():
    conversation_history = []
    while len(questions) > 0:
        user_input = questions.pop(0)
        if user_input.lower() in ["exit", "quit", "bye"]:
            cprint("Ending conversation. Goodbye!", "yellow")
            break

        user_message = {"role": "user", "content": user_input}
        conversation_history.append(user_message)

        response = client.chat.completions.create(
            messages=conversation_history,
            model=model_id,
        )
        cprint(f"> Response: {response.choices[0].message.content}", "cyan")

        assistant_message = {
            "role": "assistant",  # was user
            "content": response.choices[0].message.content,
            "finish_reason": response.choices[0].finish_reason,
        }
        conversation_history.append(assistant_message)


chat_loop()


> Response: The most famous Prime Minister of England during World War II was Winston Churchill. He served as the Prime Minister of the United Kingdom from 1940 to 1945, and again from 1951 to 1955. Churchill is widely regarded as one of the greatest wartime leaders in history, known for his leadership, oratory skills, and unwavering resolve during the war.

Churchill played a crucial role in rallying the British people during the war, and his speeches, such as the "We shall fight on the beaches" and "Their finest hour" speeches, are still remembered and celebrated today. He worked closely with other Allied leaders, including US President Franklin D. Roosevelt and Soviet leader Joseph Stalin, to coordinate the war effort and ultimately secure the defeat of Nazi Germany and the Axis powers.

Churchill's leadership and legacy continue to be celebrated and studied around the world, and he remains one of the most iconic and influential leaders of the 20th century.
> Response: Winston Churc

Here is an example for you to try a conversation yourself.
Remember to type `quit` or `exit` after you are done chatting.

In [None]:
# NBVAL_SKIP
from termcolor import cprint

def chat_loop():
    conversation_history = []
    while True:
        user_input = input("User> ")
        if user_input.lower() in ["exit", "quit", "bye"]:
            cprint("Ending conversation. Goodbye!", "yellow")
            break

        user_message = {"role": "user", "content": user_input}
        conversation_history.append(user_message)

        response = client.chat.completions.create(
            messages=conversation_history,
            model=model_id,
        )
        cprint(f"> Response: {response.choices[0].message.content}", "cyan")

        assistant_message = {
            "role": "assistant",  # was user
            "content": response.choices[0].message.content,
            "finish_reason": response.choices[0].finish_reason,
        }
        conversation_history.append(assistant_message)


chat_loop()


User> who are you?
> Response: I'm an AI assistant designed by Meta. I'm here to answer your questions, share interesting ideas and maybe even surprise you with a fresh perspective. What's on your mind?
User> how can you help me?
> Response: I can help you with a wide range of things, such as answering questions, providing information, generating text or images, summarizing content, or just having a chat. I can also help with creative tasks like brainstorming or coming up with ideas. What do you need help with today?
User> bye
Ending conversation. Goodbye!


### 1.9 Multimodal inference

In [14]:
vision_model_id = "together/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
response = client.chat.completions.create(
    model=vision_model_id,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://raw.githubusercontent.com/meta-llama/llama-models/refs/heads/main/Llama_Repo.jpeg",
                },
            },
        ],
    }],
)

print(response.choices[0].message.content)

The image depicts three llamas standing behind a table, with one of them wearing a party hat. The scene is set in a barn or stable.

*   **Llamas**
    *   There are three llamas in the image.
    *   The llama on the left is white.
    *   The middle llama is purple.
    *   The llama on the right is white and wearing a blue party hat.
    *   All three llamas have their ears perked up and are looking directly at the camera.
*   **Table**
    *   The table is made of light-colored wood.
    *   It has a few scattered items on it, including what appears to be hay or straw.
    *   A glass containing an amber-colored liquid sits on the table.
*   **Background**
    *   The background is a wooden wall or fence.
    *   The wall is made up of vertical planks of wood.

The image appears to be a playful and whimsical depiction of llamas celebrating a special occasion, possibly a birthday.


### 1.10. Streaming output

You can pass `stream=True` to stream responses from the model. You can then loop through the responses.

In [10]:
from llama_stack_client import InferenceEventLogger

message = {"role": "user", "content": "Write me a sonnet about llama"}
print(f'User> {message["content"]}')

response = client.chat.completions.create(
    messages=[message],
    model=model_id,
    stream=True,  # <-----------
)

for chunk in response:
        # Each chunk contains a delta with the content
        if len(chunk.choices) > 0 and chunk.choices[0].delta.content is not None:
            print(chunk.choices[0].delta.content, end="", flush=True)


User> Write me a sonnet about llama
Here is a sonnet about llamas:

In Andean highlands, llamas roam with pride,
Their soft, woolly coats a gentle, fuzzy hue.
Their large, dark eyes, like pools of liquid inside,
Reflect a calm and gentle spirit anew.

Their ears, so long and pointed, perk with ease,
As they survey their surroundings with quiet peace.
Their steps, deliberate and slow, release
A soothing calm that troubles cannot cease.

Their gentle humming fills the mountain air,
A soothing sound that's both serene and rare.
Their soft, padded feet, a quiet tread impart,
As they move with gentle steps, a peaceful start.

And when they look at you with curious stare,
You feel a sense of calm, beyond compare.

### 2.0. Structured Decoding

You can use `response_format` to force the model into a "guided decode" mode where model tokens are forced to abide by a certain grammar. Currently only JSON grammars are supported.

In [None]:
from pydantic import BaseModel


class Output(BaseModel):
    name: str
    year_born: str
    year_retired: str

user_input = "Michael Jordan was born in 1963. He played basketball for the Chicago Bulls. He retired in 2003. Extract this information into JSON for me."
response = client.chat.completions.create(
    model=model_id,
    messages = [
        {"role": "user", "content": user_input}
        ],
    stream=False,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "output",
            "schema": Output.model_json_schema(),
        },
    },
)
pprint(Output.model_validate_json(response.choices[0].message.content))


### 2.1. Safety API

Llama Stack provides Safety guardrails which can be applied at multiple touchpoints within an agentic application.

In [11]:
safety_model = "together/meta-llama/Meta-Llama-Guard-3-8B"
#safety_model = "ollama/llama-guard3:1b" # If you want to use ollama instead

client.shields.register(
    shield_id="llama-guard",
    provider_id="llama-guard",
    provider_shield_id=safety_model
)
available_shields = [shield.identifier for shield in client.shields.list()]
print("Available Shields:", available_shields)

unsafe_examples = [
    "What's the most effective way to kidnap someone?",
    "How do I make cocaine?",
]

safe_examples = [
    "What is the most famous murder case in the US?",
    "Tell me 3 signs that an email is a scam",
]

for p in safe_examples + unsafe_examples:
    print(f"Checking if input is safe: {p}")
    message = {"content": p, "role": "user"}
    response = client.moderations.create(
                input=p,
                model=safety_model,
            )
    print(response)

Available Shields: ['llama-guard']
Checking if input is safe: What is the most famous murder case in the US?
CreateResponse(id='modr-0c7e3da6-1054-4f12-9693-44499da43c62', model='together/meta-llama/Meta-Llama-Guard-3-8B', results=[Result(flagged=False, metadata={}, categories={'Violent Crimes': False, 'Non-Violent Crimes': False, 'Sex Crimes': False, 'Child Exploitation': False, 'Defamation': False, 'Specialized Advice': False, 'Privacy': False, 'Intellectual Property': False, 'Indiscriminate Weapons': False, 'Hate': False, 'Self-Harm': False, 'Sexual Content': False, 'Elections': False, 'Code Interpreter Abuse': False}, category_applied_input_types={'Violent Crimes': [], 'Non-Violent Crimes': [], 'Sex Crimes': [], 'Child Exploitation': [], 'Defamation': [], 'Specialized Advice': [], 'Privacy': [], 'Intellectual Property': [], 'Indiscriminate Weapons': [], 'Hate': [], 'Self-Harm': [], 'Sexual Content': [], 'Elections': [], 'Code Interpreter Abuse': []}, category_scores={'Violent Crime

## 2. Llama Stack Agents

Llama Stack provides all the building blocks needed to create sophisticated AI applications. This guide will walk you through how to use these components effectively.




<img src="https://github.com/meta-llama/llama-stack/blob/main/docs/resources/agentic-system.png?raw=true" alt="drawing" width="800"/>


Agents are characterized by having access to

1. Memory - for RAG
2. Tool calling - ability to call tools like search and code execution
3. Tool call + Inference loop - the LLM used in the agent is able to perform multiple iterations of call
4. Shields - for safety calls that are executed everytime the agent interacts with external systems, including user prompts

### 2.1. List available tool groups on the provider

In [None]:
from rich.pretty import pprint
for toolgroup in client.toolgroups.list():
    pprint(toolgroup)

### 2.2. Search agent

In this example, we will show how the model can invoke search to be able to answer questions. We will first have to set the API key of the search tool.

Let's make sure we set up a web search tool for the model to call in its agentic loop. In this tutorial, we will use [Tavily](https://tavily.com) as our search provider. Note that the "type" of the tool is still "brave_search" since Llama models have been trained with brave search as a builtin tool. Tavily is just being used in lieu of Brave search.

See steps [here](https://docs.google.com/document/d/1Vg998IjRW_uujAPnHdQ9jQWvtmkZFt74FldW2MblxPY/edit?tab=t.0#heading=h.xx02wojfl2f9).

In [None]:
web_search_response = client.responses.create(
    model=model_id,
    input="Which teams played in the NBA western conference finals of 2024",
    tools=[
        {
            "type": "web_search",
        },
    ],  # Web search for current information
)
print(f"Web search results: {web_search_response.output[-1].content[0].text}")

Web search results: The teams that played in the 2024 NBA Western Conference Finals were the Dallas Mavericks and the Minnesota Timberwolves. The Mavericks won the series 4-1.


### 2.3. RAG Agent

In this example, we will index some documentation and ask questions about that documentation.

The tool we use is the memory tool. Given a list of memory banks,the tools can help the agent query and retireve relevent chunks. In this example, we first create a memory bank and add some documents to it. Then configure the agent to use the memory tool. The difference here from the websearch example is that we pass along the memory bank as an argument to the tool. A toolgroup can be provided to the agent as just a plain name, or as a dict with both name and arguments needed for the toolgroup. These args get injected by the agent for every tool call that happens for the corresponding toolgroup.

In [None]:
from io import BytesIO


#delete any existing vector store
vector_stores_to_delete = [v.id for v in client.vector_stores.list()]
for del_vs_id in vector_stores_to_delete:
    client.vector_stores.delete(vector_store_id=del_vs_id)
print('Deleted all exisitng vector store')

docs = [
    ("Acme ships globally in 3-5 business days.", {"title": "Shipping Policy"}),
    ("Returns are accepted within 30 days of purchase.", {"title": "Returns Policy"}),
    ("Support is available 24/7 via chat and email.", {"title": "Support"}),
]
query = "How long does shipping take?"
file_ids = []
for content, metadata in docs:
  with BytesIO(content.encode()) as file_buffer:
      file_buffer.name = f"{metadata['title'].replace(' ', '_').lower()}.txt"
      create_file_response = client.files.create(file=file_buffer, purpose="assistants")
      print(create_file_response)
      file_ids.append(create_file_response.id)

# Create vector store with files
vector_store = client.vector_stores.create(
  name="acme_docs",
  file_ids=file_ids,
  embedding_model="sentence-transformers/all-MiniLM-L6-v2",
  embedding_dimension=384,
  provider_id="faiss"
)
print("Listing available vector stores:")
vector_stores = client.vector_stores.list()
for vs in vector_stores:
    print(f"- {vs.name} (ID: {vs.id})")
    files_in_store = client.vector_stores.files.list(vector_store_id=vs.id)
    if files_in_store:
        print(f"  - Files in vector store '{vs.name}' (ID: {vs.id}):")
        for file in files_in_store:
            print(f"- {file.id}")
print("Searching Vector_store with query")
file_search_response = client.responses.create(
    model=model_id,
    input=query,
    tools=[
        {  # Using Responses API built-in tools
            "type": "file_search",
            "vector_store_ids": [vector_store.id],  # Vector store containing uploaded files
        },
    ],
)
print(file_search_response)
print(f"File search results: {file_search_response.output[-1].content[0].text}")


Deleted all exisitng vector store
File(id='file-354f3e6b09974322b5ad0007d5ece533', bytes=41, created_at=1758228715, expires_at=1789764715, filename='shipping_policy.txt', object='file', purpose='assistants')
File(id='file-94933acc81c043c9984d912736235294', bytes=48, created_at=1758228715, expires_at=1789764715, filename='returns_policy.txt', object='file', purpose='assistants')
File(id='file-540a598305114c1b90f68142cae56dc8', bytes=45, created_at=1758228715, expires_at=1789764715, filename='support.txt', object='file', purpose='assistants')
Listing available vector stores:
- acme_docs (ID: vs_4fba2b6a-0123-40c2-9dcf-61b6c50ec8c9)
  - Files in vector store 'acme_docs' (ID: vs_4fba2b6a-0123-40c2-9dcf-61b6c50ec8c9):
- file-354f3e6b09974322b5ad0007d5ece533
- file-94933acc81c043c9984d912736235294
- file-540a598305114c1b90f68142cae56dc8
Searching Vector_store with query
ResponseObject(id='resp-543f47fd-5bda-459d-8d61-39383a34bcf0', created_at=1758228715, model='groq/llama-3.1-8b-instant', ob

### 2.4. Using Model Context Protocol

In this example, we will show how tools hosted in an MCP server can be configured to be used by the model.

In the following steps, we will use the [filesystem tool](https://github.com/modelcontextprotocol/servers/tree/main/src/filesystem) to explore the files and folders available in the /content directory

Use xterm module to start a shell to run the MCP server using the `supergateway` tool which can start an MCP tool and serve it over HTTP.

### 2.4. Using Model Context Protocol


This section demonstrates how to use the Model Context Protocol (MCP) with Llama Stack to interact with external tools hosted on an MCP server.


- This example demonstrates how to use the Llama Stack client to interact with a remote MCP tool.
- In this specific example, it connects to a remote Cloudflare documentation MCP server (`https://docs.mcp.cloudflare.com/sse`).
- The `client.responses.create` method is used with the `mcp` tool type, specifying the server details and the user input ("what is cloudflare").


**Key Concepts:**

- **Model Context Protocol (MCP):** A protocol that allows language models to interact with external tools and services.
- **MCP Tool:** A specific tool (like filesystem or a dice roller) that adheres to the MCP and can be interacted with by an MCP-enabled agent.
- **`client.responses.create`:** The Llama Stack client method used to create a response from a model, which can include tool calls to MCP tools.

This setup provides a flexible way to extend the capabilities of your Llama Stack agents by integrating with various external services and tools via the Model Context Protocol.

In [None]:
# NBVAL_SKIP
resp = client.responses.create(
    model=model_id,
    tools=[
        {
            "type": "mcp",
            "server_label": "cloudflare_docs",
            "server_description": "A MCP server for cloudflare documentation.",
            "server_url": "https://docs.mcp.cloudflare.com/sse",
            "require_approval": "never",
        },
    ],
    input="what is cloudflare",
)

print(resp.output_text)

Cloudflare is a cloud-based service that provides a range of features to help protect and improve the performance, security, and reliability of websites, applications, and other online services. It is one of the world's largest connectivity cloud networks, powering Internet requests for millions of websites and serving 55 million HTTP requests per second on average.

Some of the key things Cloudflare does include:

1. Content Delivery Network (CDN): caching website content across a network of servers worldwide to reduce load times.
2. DDoS Protection: protecting against Distributed Denial-of-Service attacks by filtering out malicious traffic.
3. Firewall: acting as an additional layer of security, filtering out hacking attempts and malicious traffic.
4. SSL Encryption: providing free SSL encryption to secure sensitive information.
5. Bot Protection: identifying and blocking bots trying to exploit vulnerabilities or scrape content.
6. Analytics: providing insights into website traffic t

### 2.5 Response API Branching

The Llama Stack Response API supports branching, allowing you to explore different conversational paths or tool interactions based on a previous response. This is useful for scenarios where you want to try alternative approaches or gather information from different sources without losing the context of the initial interaction.

To branch from a previous response, you use the `previous_response_id` parameter in the `client.responses.create` method. This parameter takes the `id` of the response you want to branch from.

Here's how it works:

1. **Initial Response:** You make an initial call to `client.responses.create` to get a response. This response will have a unique `id`.

2. **Branching Response:** You make a subsequent call to `client.responses.create` for your branching query. In this call, you set the `previous_response_id` to the `id` of the initial response.

The new response will be generated in the context of the previous response, but you can specify different tools, inputs, or other parameters to explore a different path.

**Example:**

Let's say you made an initial web search about a topic and got `response1`. You can then branch from `response1` to perform a file search on the same topic by setting `previous_response_id=response1.id` in the second `client.responses.create` call.

In [None]:
from io import BytesIO
import uuid

# delete any existing vector store
vector_stores_to_delete = [v.id for v in client.vector_stores.list()]
for del_vs_id in vector_stores_to_delete:
    client.vector_stores.delete(vector_store_id=del_vs_id)
print('Deleted all existing vector stores')

# Create a dummy file for the file search
dummy_file_content = "Popular sorting implementations include quicksort, mergesort, heapsort, and insertion sort. Bubble sort and selection sort are used for small or simple datasets. Counting sort, radix sort, and bucket sort handle special numeric cases efficiently without comparisons. Timsort, a hybrid of merge and insertion sort, is widely used in Python and Java. Shell sort, comb sort, cocktail sort, and others are less common but exist for special scenarios."
with BytesIO(dummy_file_content.encode()) as file_buffer:
    file_buffer.name = "sorting_algorithms.txt"
    create_file_response = client.files.create(file=file_buffer, purpose="assistants")
    print(create_file_response)
    file_id = create_file_response.id

# Create a vector store with the dummy file
vector_store = client.vector_stores.create(
  name="sorting_docs",
  file_ids=[file_id],
  embedding_model="sentence-transformers/all-MiniLM-L6-v2",
  embedding_dimension=384, # This should match the embedding model
  provider_id="faiss"
)
print("Listing available vector stores:")
vector_stores = client.vector_stores.list()
for vs in vector_stores:
    print(f"- {vs.name} (ID: {vs.id})")

# First response: Use web search for latest algorithms
response1 = client.responses.create(
    model=model_id, # Changed model to one available in the notebook
    input="Search for the latest efficient sorting algorithms and their performance comparisons",
    tools=[
        {
            "type": "web_search",
        },
    ],  # Web search for current information
)
print(f"Web search results: {response1.output[-1].content[0].text}")

# Continue conversation: Switch to file search for local docs
response2 = client.responses.create(
    model=model_id,  # Changed model to one available in the notebook
    input="Now search my uploaded files for existing sorting implementations",
    tools=[
        {  # Using Responses API built-in tools
            "type": "file_search",
            "vector_store_ids": [vector_store.id],  # Use the created vector store ID
        },
    ],
    previous_response_id=response1.id,
)

# # Branch from first response: Try different search approach
# response3 = client.responses.create(
#     model=model_id, # Changed model to one available in the notebook
#     input="Instead, search the web for Python-specific sorting best practices",
#     tools=[{"type": "web_search"}],  # Different web search query
#     previous_response_id=response1.id,  # Branch from response1
# )

# # Responses API benefits:
# # âœ… Dynamic tool switching (web search â†” file search per call)
# # âœ… OpenAI-compatible tool patterns (web_search, file_search)
# # âœ… Branch conversations to explore different information sources
# # âœ… Model flexibility per search type
# print(f"Web search results: {response1.output_text}") # Changed to output_text
# print(f"File search results: {response2.output_text}") # Changed to output_text
# print(f"Alternative web search: {response3.output_text}") # Changed to output_text

Deleted all existing vector stores
File(id='file-10ececb2f1234dce803436ba78a718fe', bytes=446, created_at=1758326763, expires_at=1789862763, filename='sorting_algorithms.txt', object='file', purpose='assistants')
Listing available vector stores:
- sorting_docs (ID: vs_69afb313-1c2d-4115-a9f4-8d31f4ff1ef3)
Web search results: The latest efficient sorting algorithms include Quicksort, Merge Sort, and Heap Sort, which have been compared in various studies for their performance. Quicksort is considered one of the fastest in-place sorting algorithms with good cache performance. Other algorithms like Bubble Sort, Selection Sort, and Insertion Sort are generally slower. For big data environments, several efficient sorting algorithms have been analyzed to improve processing speed. Some sources comparing the performance of these algorithms include Codemotion, Medium, Quora, ScienceDirect, and Built In.


InternalServerError: Error code: 500 - {'detail': 'Internal server error: An unexpected error occurred.'}

### Cleaning up the server

To stop the Llama Stack server and remove any created files and configurations, you can use the following code. This is useful for resetting your environment or before running the notebook again.

1. **Stop the server:** The code includes a helper function `kill_llama_stack_server()` that finds and terminates the running server process.
2. **Remove distribution files:** It also removes the distribution files located in `~/.llama/distributions/*`, which contain the server configuration and data.

In [14]:
# Remove distribution files
!rm -rf ~/.llama/distributions/*

import os
# use this helper if needed to kill the server
def kill_llama_stack_server():
    # Kill any existing llama stack server processes
    os.system("ps aux | grep -v grep | grep llama_stack.core.server.server | awk '{print $2}' | xargs kill -9")
kill_llama_stack_server()