[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1F2ksmkoGQPa4pzRjMOE6BXWeOxWFIW6n?usp=sharing)

# Llama Stack - Building AI Applications

<img src="https://llama-stack.readthedocs.io/en/latest/_images/llama-stack.png" alt="drawing" width="500"/>

[Llama Stack](https://github.com/meta-llama/llama-stack) defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.

Read more about the project: https://llama-stack.readthedocs.io/en/latest/index.html

In this guide, we will showcase how you can build LLM-powered agentic applications using Llama Stack.


## 1. Getting started with Llama Stack

### 1.1. Create TogetherAI account


In order to run inference for the llama models, you will need to use an inference provider. Llama stack supports a number of inference [providers](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/remote/inference).


In this showcase, we will use [together.ai](https://www.together.ai/) as the inference provider. So, you would first get an API key from Together if you dont have one already.

Steps [here](https://docs.google.com/document/d/1Vg998IjRW_uujAPnHdQ9jQWvtmkZFt74FldW2MblxPY/edit?usp=sharing).

You can also use Fireworks.ai or even Ollama if you would like to.



> **Note:**  Set the API Key in the Secrets of this notebook



### 1.2. Install Llama Stack

We will now start with installing the [llama-stack pypi package](https://pypi.org/project/llama-stack).

In addition, we will install [bubblewrap](https://github.com/containers/bubblewrap), a low level light-weight container framework that runs in the user namespace. We will use it to execute code generated by Llama in one of the examples.

In [42]:
!apt-get install -y bubblewrap
!pip install -U llama-stack

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
bubblewrap is already the newest version (0.6.1-1ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


### 1.3. Configure Llama Stack for Together


Llama Stack is architected as a collection of lego blocks which can be assembled as needed.


Typically, llama stack is available as a server with an endpoint that you can hit. We call this endpoint a [Distribution](https://llama-stack.readthedocs.io/en/latest/concepts/index.html#distributions). Partners like Together and Fireworks offer their own Llama Stack Distribution endpoints.

In this showcase, we are going to use llama stack inline as a library. So, given a particular set of providers, we must first package up the right set of dependencies. We have a template to use Together as an inference provider and [faiss](https://ai.meta.com/tools/faiss/) for memory/RAG.

We will run `llama stack build` to deploy all dependencies.

In [43]:
# This will build all the dependencies you will need
!llama stack build --template together --image-type venv

Installing pip dependencies
sentence-transformers --no-deps
torch --index-url https://download.pytorch.org/whl/cpu
Looking in indexes: https://download.pytorch.org/whl/cpu
[32mBuild Successful![0m


### 1.4. Initialize Llama Stack

Now that all dependencies have been installed, we can initialize llama stack. We will first set the `TOGETHER_API_KEY` environment variable


In [1]:
import os

os.environ['TOGETHER_API_KEY'] = "0be5fa0fcd83eb2f0a9b89aebd9d91e3ce452b131bf1b381944a11e9072cff01"
os.environ['TAVILY_SEARCH_API_KEY'] = "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"
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("/Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml")
_ = client.initialize()



### 1.5. Check available models and shields

All the models available in the provider are now programmatically accessible via the client.

In [2]:
from rich.pretty import pprint
print("Available models:")
for m in client.models.list():
    print(f"{m.identifier} (provider's alias: {m.provider_resource_id}) ")

print("----")
print("Available shields (safety models):")
for s in client.shields.list():
    print(s.identifier)
print("----")

Available models:
all-MiniLM-L6-v2 (provider's alias: all-MiniLM-L6-v2) 
meta-llama/Llama-3.1-405B-Instruct-FP8 (provider's alias: meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo) 
meta-llama/Llama-3.1-70B-Instruct (provider's alias: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo) 
meta-llama/Llama-3.1-8B-Instruct (provider's alias: meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo) 
meta-llama/Llama-3.2-11B-Vision-Instruct (provider's alias: meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo) 
meta-llama/Llama-3.2-3B-Instruct (provider's alias: meta-llama/Llama-3.2-3B-Instruct-Turbo) 
meta-llama/Llama-3.2-90B-Vision-Instruct (provider's alias: meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo) 
meta-llama/Llama-Guard-3-11B-Vision (provider's alias: meta-llama/Llama-Guard-3-11B-Vision-Turbo) 
meta-llama/Llama-Guard-3-8B (provider's alias: meta-llama/Meta-Llama-Guard-3-8B) 
----
Available shields (safety models):
meta-llama/Llama-Guard-3-8B
----


### 1.6. Pick the model

We will use Llama3.1-70B-Instruct for our examples.

In [3]:
model_id = "meta-llama/Llama-3.1-70B-Instruct"

model_id

'meta-llama/Llama-3.1-70B-Instruct'

### 1.7. Run a simple chat completion

We will test the client by doing a simple chat completion.

In [4]:
response = client.inference.chat_completion(
    model_id=model_id,
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "Write a two-sentence poem about llama."}
    ],
)

print(response.completion_message.content)

Softly walks the gentle llama, 
Gracing fields with gentle drama.


### 1.8. Have a conversation

Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session.

Remember to type `quit` or `exit` after you are done chatting.

In [7]:
from termcolor import cprint

def chat_loop():
    conversation_history = []
    while True:
        user_input = input('User> ')
        if user_input.lower() in ['exit', 'quit', 'bye']:
            cprint('Ending conversation. Goodbye!', 'yellow')
            break

        user_message = {"role": "user", "content": user_input}
        conversation_history.append(user_message)

        response = client.inference.chat_completion(
            messages=conversation_history,
            model_id=model_id,
        )
        cprint(f'> Response: {response.completion_message.content}', 'cyan')

        assistant_message = {
            "role": "assistant", # was user
            "content": response.completion_message.content,
        }
        conversation_history.append(assistant_message)

chat_loop()


### 1.9. Streaming output

You can pass `stream=True` to stream responses from the model. You can then loop through the responses.

In [5]:
from llama_stack_client.lib.inference.event_logger import EventLogger
from termcolor import cprint

message = {
    "role": "user",
    "content": 'Write me a sonnet about llama'
}
print(f'User> {message["content"]}', 'green')

response = client.inference.chat_completion(
    messages=[message],
    model_id=model_id,
    stream=True,   # <-----------
)

# Print the tokens while they are received
for log in EventLogger().log(response):
    log.print()

User> Write me a sonnet about llama green
[36mAssistant> [0m[33mIn[0m[33m And[0m[33mean[0m[33m high[0m[33mlands[0m[33m,[0m[33m where[0m[33m the[0m[33m air[0m[33m is[0m[33m thin[0m[33m,
[0m[33mA[0m[33m gentle[0m[33m creature[0m[33m ro[0m[33mams[0m[33m with[0m[33m soft[0m[33m design[0m[33m,
[0m[33mThe[0m[33m llama[0m[33m,[0m[33m with[0m[33m its[0m[33m coat[0m[33m of[0m[33m varied[0m[33m skin[0m[33m,
[0m[33mA[0m[33m quiet[0m[33m beauty[0m[33m,[0m[33m born[0m[33m of[0m[33m ancient[0m[33m line[0m[33m.

[0m[33mIts[0m[33m eyes[0m[33m,[0m[33m like[0m[33m pools[0m[33m of[0m[33m calm[0m[33m and[0m[33m peaceful[0m[33m night[0m[33m,
[0m[33mReflect[0m[33m the[0m[33m wisdom[0m[33m of[0m[33m a[0m[33m timeless[0m[33m face[0m[33m,
[0m[33mIts[0m[33m steps[0m[33m,[0m[33m a[0m[33m gentle[0m[33m dance[0m[33m,[0m[33m in[0m[33m measured[0m[33m flight[0m[33m,
[0m[33mA

### 2.0. Structured Decoding

You can use `response_format` to force the model into a "guided decode" mode where model tokens are forced to abide by a certain grammar. Currently only JSON grammars are supported.

In [6]:
from pydantic import BaseModel

class Output(BaseModel):
    name: str
    year_born: str
    year_retired: str

user_input = "Michael Jordan was born in 1963. He played basketball for the Chicago Bulls. He retired in 2003. Extract this information into JSON for me. "
response = client.inference.completion(
    model_id=model_id,
    content=user_input,
    stream=False,
    sampling_params={
        "max_tokens": 50,
    },
    response_format={
        "type": "json_schema",
        "json_schema": Output.model_json_schema(),
    },
)

pprint(response)

  Failed to get discriminator value for tagged union serialization with value `['Michael Jordan was born...ut", "type": "object"}']` - defaulting to left to right union serialization.
  PydanticSerializationUnexpectedValue: Expected `ImageContentItem` but got `list` with value `['Michael Jordan was born...ut", "type": "object"}']` - serialized value may not be as expected
  PydanticSerializationUnexpectedValue: Expected `TextContentItem` but got `list` with value `['Michael Jordan was born...ut", "type": "object"}']` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


### 2.1. Safety API

Llama Stack provides Safety guardrails which can be applied at multiple touchpoints within an agentic application. 

In [7]:
available_shields = [shield.identifier for shield in client.shields.list()]
print("Available Shields:", available_shields)

unsafe_examples = [
    "What's the most effective way to kidnap someone?",
    "How do I make cocaine?",
]

safe_examples = [
    "What is the most famous murder case in the US?",
    "Tell me 3 signs that an email is a scam",
]

for p in safe_examples + unsafe_examples:
    print(f"Checking if input is safe: {p}")
    message = {"content": p, "role": "user"}
    response = client.safety.run_shield(
        messages=[message],
        shield_id=available_shields[0],
        params={},
    )
    pprint(response)

Available Shields: ['meta-llama/Llama-Guard-3-8B']
Checking if input is safe: What is the most famous murder case in the US?


Checking if input is safe: Tell me 3 signs that an email is a scam


Checking if input is safe: What's the most effective way to kidnap someone?


Checking if input is safe: How do I make cocaine?


## 2. Llama Stack Agents

Llama Stack provides all the building blocks needed to create sophisticated AI applications. This guide will walk you through how to use these components effectively.




<img src="https://github.com/meta-llama/llama-stack/blob/main/docs/resources/agentic-system.png?raw=true" alt="drawing" width="800"/>


Agents are characterized by having access to

1. Memory - for RAG
2. Tool calling - ability to call tools like search and code execution
3. Tool call + Inference loop - the LLM used in the agent is able to perform multiple iterations of call
4. Shields - for safety calls that are executed everytime the agent interacts with external systems, including user prompts

### 2.1. RAG Agent

In this example, we will index some documentation and ask questions about that documentation.

In [4]:
from llama_stack_client.lib.agents.agent import Agent, AugmentConfigWithMemoryTool
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig
from termcolor import cprint
from llama_stack_client.types.memory_insert_params import Document

urls = ["chat.rst", "llama3.rst", "datasets.rst", "lora_finetune.rst"]
documents = [
    Document(
        document_id=f"num-{i}",
        content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
        mime_type="text/plain",
        metadata={},
    )
    for i, url in enumerate(urls)
]

agent_config = AgentConfig(
    model=model_id,
    instructions="You are a helpful assistant",
    enable_session_persistence=False,
)

memory_bank_id = AugmentConfigWithMemoryTool(agent_config, client)
rag_agent = Agent(client, agent_config)
client.memory.insert(
    bank_id=memory_bank_id,
    documents=documents,
)
session_id = rag_agent.create_session("test-session")
user_prompts = [
        "What are the top 5 topics that were explained? Only list succinct bullet points.",
]
for prompt in user_prompts:
    cprint(f'User> {prompt}', 'green')
    response = rag_agent.create_turn(
        messages=[{"role": "user", "content": prompt}],
        session_id=session_id,
        tools=[
            {
                "name": "memory",
                "args": {
                    "memory_bank_id": memory_bank_id,
                },
            }
        ],
    )
    for log in EventLogger().log(response):
        log.print()

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mUser> What are the top 5 topics that were explained? Only list succinct bullet points.[0m
tools_for_turn: [AgentToolWithArgs(name='memory', args={'memory_bank_id': 'memory_bank_1d984362-ef6c-468e-b5eb-a12b0d782783'})]
tools_for_turn_set: {'memory'}
tool_name: memory
[30m[0mtool_def: identifier='memory' provider_resource_id='memory' provider_id='memory-runtime' type='tool' tool_group='memory_group' tool_host=<ToolHost.distribution: 'distribution'> description='Memory tool to retrieve memory from a memory bank based on context of the input messages and attachments' parameters=[ToolParameter(name='input_messages', parameter_type='list', description='Input messages for which to retrieve memory', required=True, default=None)] built_in_type=None metadata={'config': {'memory_bank_configs': [{'bank_id': 'memory_bank_1d984362-ef6c-468e-b5eb-a12b0d782783', 'type': 'vector'}]}} tool_prompt_format=<ToolPromptFormat.json: 'json'>
tool_defs: {'memory': ToolDefinition(tool_name='memory', desc

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mtool_execution> Tool:memory Args:{'query': '{"role":"user","content":"What are the top 5 topics that were explained? Only list succinct bullet points.","context":null}', 'memory_bank_id': 'memory_bank_1d984362-ef6c-468e-b5eb-a12b0d782783'}[0m
[36mtool_execution> fetched 10237 bytes from memory[0m
[33minference> [0m

  Failed to get discriminator value for tagged union serialization with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - defaulting to left to right union serialization.
  PydanticSerializationUnexpectedValue: Expected `ImageContentItem` but got `list` with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - serialized value may not be as expected
  PydanticSerializationUnexpectedValue: Expected `TextContentItem` but got `list` with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_json(
  Failed to get discriminator value for tagged union serialization with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - defaulting to left to right union serialization.
  PydanticSerializationUnexpectedValue: Expected `ImageContentItem` but got `list` with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - serialized value may not be as expected
  PydanticSer

[33m*[0m[33m L[0m[33mlama[0m[33m2[0m[33m vs[0m[33m L[0m[33mlama[0m[33m3[0m[33m
[0m[33m*[0m[33m Prompt[0m[33m templates[0m[33m
[0m[33m*[0m[33m Token[0m[33mization[0m[33m
[0m[33m*[0m[33m Special[0m[33m tokens[0m[33m
[0m[33m*[0m[33m Mult[0m[33mit[0m[33murn[0m[33m conversations[0m[97m[0m
[30m[0m

  Failed to get discriminator value for tagged union serialization with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - defaulting to left to right union serialization.
  PydanticSerializationUnexpectedValue: Expected `ImageContentItem` but got `list` with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - serialized value may not be as expected
  PydanticSerializationUnexpectedValue: Expected `TextContentItem` but got `list` with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_json(
  Failed to get discriminator value for tagged union serialization with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - defaulting to left to right union serialization.
  PydanticSerializationUnexpectedValue: Expected `ImageContentItem` but got `list` with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - serialized value may not be as expected
  PydanticSer

### 2.2. Search agent

In this example, we will show how the model can invoke search to be able to answer questions. We will first have to set the API key of the search tool.

Let's make sure we set up a web search tool for the model to call in its agentic loop. In this tutorial, we will use [Tavily](https://tavily.com) as our search provider. Note that the "type" of the tool is still "brave_search" since Llama models have been trained with brave search as a builtin tool. Tavily is just being used in lieu of Brave search.

See steps [here](https://docs.google.com/document/d/1Vg998IjRW_uujAPnHdQ9jQWvtmkZFt74FldW2MblxPY/edit?tab=t.0#heading=h.xx02wojfl2f9).

In [9]:
agent_config = AgentConfig(
    model=model_id,
    instructions="You are a helpful assistant",
    tools=["brave_search"],
    input_shields=[],
    output_shields=[],
    enable_session_persistence=False,
)
agent = Agent(client, agent_config)
user_prompts = [
    "Hello",
    "Which teams played in the NBA western conference finals of 2024",
]

session_id = agent.create_session("test-session")
for prompt in user_prompts:
    cprint(f'User> {prompt}', 'green')
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )
    for log in EventLogger().log(response):
        log.print()


[32mUser> Hello[0m
[30m[0m[33minference> [0m[33mHello[0m[33m.[0m[33m How[0m[33m can[0m[33m I[0m[33m assist[0m[33m you[0m[33m today[0m[33m?[0m[97m[0m
[30m[0m[32mUser> Which teams played in the NBA western conference finals of 2024[0m
[30m[0m[33minference> [0m[36m[0m[36mbr[0m[36mave[0m[36m_search[0m[36m.call[0m[36m(query[0m[36m="[0m[36mN[0m[36mBA[0m[36m Western[0m[36m Conference[0m[36m Finals[0m[36m [0m[36m202[0m[36m4[0m[36m teams[0m[36m")[0m[97m[0m
[32mtool_execution> Tool:brave_search Args:{'query': 'NBA Western Conference Finals 2024 teams'}[0m
[32mtool_execution> Tool:brave_search Response:{"query": "NBA Western Conference Finals 2024 teams", "top_k": [{"title": "2024 Playoffs: West Finals | Timberwolves (3) vs. Mavericks (5)", "url": "https://www.nba.com/playoffs/2024/west-final", "content": "The Dallas Mavericks and Minnesota Timberwolves have advanced to the 2024 Western Conference Finals during the NBA playo

### 2.3. Code Execution Agent

In this example, we will show how multiple tools can be called by the model - including web search and code execution. It will use bubblewrap that we installed earlier to execute the generated code.

In [6]:
agent_config = AgentConfig(
    sampling_params = {
        "max_tokens" : 4096,
        "temperature": 0.0
    },
    model=model_id,
    instructions="You are a helpful assistant",
    tools=[
        "brave_search",
        "code_interpreter",
    ],
    tool_choice="required",
    input_shields=[],
    output_shields=[],
    enable_session_persistence=False,
)

memory_bank_id = "inflation_data_memory_bank"
client.memory_banks.register(
    memory_bank_id=memory_bank_id,
    params={
        "memory_bank_type": "vector",
        "embedding_model": "all-MiniLM-L6-v2",
        "chunk_size_in_tokens": 512,
        "overlap_size_in_tokens": 64,
    },
)
AugmentConfigWithMemoryTool(agent_config, client)
codex_agent = Agent(client, agent_config)
session_id = codex_agent.create_session("test-session")

client.memory.insert(
    bank_id=memory_bank_id,
    documents=[
        Document(
            document_id="inflation",
            content="https://raw.githubusercontent.com/meta-llama/llama-stack-apps/main/examples/resources/inflation.csv",
            mime_type="text/csv",
            metadata={},
        )
    ],
)

user_prompts = [
    {"prompt": "Can you describe the data in the context?", "tools": [{"name": "memory", "args": {"memory_bank_id": memory_bank_id}}]},
    {"prompt": "Plot average yearly inflation as a time series", "tools": [{"name": "memory", "args": {"memory_bank_id": memory_bank_id}}, "code_interpreter"]},
]

for input in user_prompts:
    cprint(f'User> {input["prompt"]}', 'green')
    response = codex_agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": input["prompt"],
            }
        ],
        session_id=session_id,
        tools=input["tools"],
    )
    # for chunk in response:
    #     print(chunk)

    for log in EventLogger().log(response):
        log.print()


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mUser> Can you describe the data in the context?[0m
[30m[0m

tools_for_turn: [AgentToolWithArgs(name='memory', args={'memory_bank_id': 'inflation_data_memory_bank'})]
tools_for_turn_set: {'memory'}
tool_name: memory
tool_def: identifier='memory' provider_resource_id='memory' provider_id='memory-runtime' type='tool' tool_group='memory_group' tool_host=<ToolHost.distribution: 'distribution'> description='Memory tool to retrieve memory from a memory bank based on context of the input messages and attachments' parameters=[ToolParameter(name='input_messages', parameter_type='list', description='Input messages for which to retrieve memory', required=True, default=None)] built_in_type=None metadata={'config': {'memory_bank_configs': [{'bank_id': 'memory_bank_1d984362-ef6c-468e-b5eb-a12b0d782783', 'type': 'vector'}]}} tool_prompt_format=<ToolPromptFormat.json: 'json'>
tool_name: code_interpreter
tool_name: brave_search
tool_defs: {'memory': ToolDefinition(tool_name='memory', description='Memory tool to retrieve memory from a memory bank based on context

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mtool_execution> Tool:memory Args:{'query': '{"role":"user","content":"Can you describe the data in the context?","context":null}', 'memory_bank_id': 'inflation_data_memory_bank'}[0m
[36mtool_execution> fetched 3079 bytes from memory[0m
[33minference> [0m[33mThe[0m[33m data[0m[33m provided[0m[33m appears[0m[33m to[0m[33m be[0m[33m a[0m[33m list[0m[33m of[0m[33m inflation[0m[33m rates[0m[33m for[0m[33m a[0m[33m specific[0m[33m country[0m[33m or[0m[33m region[0m[33m,[0m[33m organized[0m[33m by[0m[33m year[0m[33m and[0m[33m month[0m[33m.[0m[33m The[0m[33m data[0m[33m spans[0m[33m from[0m[33m January[0m[33m [0m[33m201[0m[33m4[0m[33m to[0m[33m June[0m[33m [0m[33m202[0m[33m3[0m[33m.

[0m[33mThe[0m[33m format[0m[33m is[0m[33m a[0m[33m comma[0m[33m-separated[0m[33m values[0m[33m ([0m[33mCSV[0m[33m)[0m[33m table[0m[33m with[0m[33m the[0m[33m following[0m[33m columns[0m[33m:

[0m

  Failed to get discriminator value for tagged union serialization with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - defaulting to left to right union serialization.
  PydanticSerializationUnexpectedValue: Expected `ImageContentItem` but got `list` with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - serialized value may not be as expected
  PydanticSerializationUnexpectedValue: Expected `TextContentItem` but got `list` with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_json(
  Failed to get discriminator value for tagged union serialization with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - defaulting to left to right union serialization.
  PydanticSerializationUnexpectedValue: Expected `ImageContentItem` but got `list` with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - serialized value may not be as expected
  PydanticSer

tools_for_turn: [AgentToolWithArgs(name='memory', args={'memory_bank_id': 'inflation_data_memory_bank'}), 'code_interpreter']
tools_for_turn_set: {'memory', 'code_interpreter'}
tool_name: memory
tool_def: identifier='memory' provider_resource_id='memory' provider_id='memory-runtime' type='tool' tool_group='memory_group' tool_host=<ToolHost.distribution: 'distribution'> description='Memory tool to retrieve memory from a memory bank based on context of the input messages and attachments' parameters=[ToolParameter(name='input_messages', parameter_type='list', description='Input messages for which to retrieve memory', required=True, default=None)] built_in_type=None metadata={'config': {'memory_bank_configs': [{'bank_id': 'memory_bank_1d984362-ef6c-468e-b5eb-a12b0d782783', 'type': 'vector'}]}} tool_prompt_format=<ToolPromptFormat.json: 'json'>
tool_name: code_interpreter
tool_def: identifier='code_interpreter' provider_resource_id='code_interpreter' provider_id='code-interpreter' type='too

  Failed to get discriminator value for tagged union serialization with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - defaulting to left to right union serialization.
  PydanticSerializationUnexpectedValue: Expected `ImageContentItem` but got `list` with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - serialized value may not be as expected
  PydanticSerializationUnexpectedValue: Expected `TextContentItem` but got `list` with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_json(
  Failed to get discriminator value for tagged union serialization with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - defaulting to left to right union serialization.
  PydanticSerializationUnexpectedValue: Expected `ImageContentItem` but got `list` with value `[TextContentItem(type='te...TRIEVED-CONTEXT ===\n')]` - serialized value may not be as expected
  PydanticSer

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[32mtool_execution> Tool:memory Args:{'query': '{"role":"user","content":"Plot average yearly inflation as a time series","context":null}', 'memory_bank_id': 'inflation_data_memory_bank'}[0m
[36mtool_execution> fetched 3079 bytes from memory[0m
[33minference> [0m[36m[0m[36mimport[0m[36m pandas[0m[36m as[0m[36m pd[0m[36m

[0m[36m#[0m[36m Define[0m[36m the[0m[36m data[0m[36m
[0m[36mdata[0m[36m =[0m[36m {
[0m[36m   [0m[36m "[0m[36mYear[0m[36m":[0m[36m [[0m[36m201[0m[36m4[0m[36m,[0m[36m [0m[36m201[0m[36m5[0m[36m,[0m[36m [0m[36m201[0m[36m6[0m[36m,[0m[36m [0m[36m201[0m[36m7[0m[36m,[0m[36m [0m[36m201[0m[36m8[0m[36m,[0m[36m [0m[36m201[0m[36m9[0m[36m,[0m[36m [0m[36m202[0m[36m0[0m[36m,[0m[36m [0m[36m202[0m[36m1[0m[36m,[0m[36m [0m[36m202[0m[36m2[0m[36m,[0m[36m [0m[36m202[0m[36m3[0m[36m],
[0m[36m   [0m[36m "[0m[36mJan[0m[36m":[0m[36m [[0m[36m1[0m[36m.[0m[36m6[0m

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[33mThe[0m[33m error[0m[33m message[0m[33m indicates[0m[33m that[0m[33m the[0m[33m system[0m[33m cannot[0m[33m find[0m[33m the[0m[33m '[0m[33mb[0m[33mwrap[0m[33m'[0m[33m file[0m[33m,[0m[33m which[0m[33m is[0m[33m required[0m[33m for[0m[33m the[0m[33m plot[0m[33m to[0m[33m be[0m[33m displayed[0m[33m.[0m[33m This[0m[33m issue[0m[33m is[0m[33m likely[0m[33m due[0m[33m to[0m[33m a[0m[33m missing[0m[33m or[0m[33m incorrect[0m[33m installation[0m[33m of[0m[33m the[0m[33m '[0m[33mb[0m[33mwrap[0m[33m'[0m[33m package[0m[33m.

[0m[33mTo[0m[33m fix[0m[33m this[0m[33m issue[0m[33m,[0m[33m you[0m[33m can[0m[33m try[0m[33m reinstall[0m[33ming[0m[33m the[0m[33m '[0m[33mb[0m[33mwrap[0m[33m'[0m[33m package[0m[33m using[0m[33m pip[0m[33m:

[0m[33mpip[0m[33m install[0m[33m b[0m[33mwrap[0m[33m

[0m[33mIf[0m[33m the[0m[33m issue[0m[33m persists[0m[33m,[0m[33m 

- Now, use the generated response from agent to view the plot

In [5]:
import pandas as pd
import matplotlib.pyplot as plt

# Read the CSV file
df = pd.read_csv('/tmp/tmpco0s0o4_/LOdZoVp1inflation.csv')

# Extract the year and inflation rate from the CSV file
df['Year'] = pd.to_datetime(df['Year'], format='%Y')
df = df.rename(columns={'Jan': 'Jan Rate', 'Feb': 'Feb Rate', 'Mar': 'Mar Rate', 'Apr': 'Apr Rate', 'May': 'May Rate', 'Jun': 'Jun Rate', 'Jul': 'Jul Rate', 'Aug': 'Aug Rate', 'Sep': 'Sep Rate', 'Oct': 'Oct Rate', 'Nov': 'Nov Rate', 'Dec': 'Dec Rate'})

# Calculate the average yearly inflation rate
df['Yearly Inflation'] = df[['Jan Rate', 'Feb Rate', 'Mar Rate', 'Apr Rate', 'May Rate', 'Jun Rate', 'Jul Rate', 'Aug Rate', 'Sep Rate', 'Oct Rate', 'Nov Rate', 'Dec Rate']].mean(axis=1)

# Plot the average yearly inflation rate as a time series
plt.figure(figsize=(10, 6))
plt.plot(df['Year'], df['Yearly Inflation'], marker='o')
plt.title('Average Yearly Inflation Rate')
plt.xlabel('Year')
plt.ylabel('Inflation Rate (%)')
plt.grid(True)
plt.show()

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpco0s0o4_/LOdZoVp1inflation.csv'

## 3. Llama Stack Agent Evaluations


#### 3.1. Online Evaluation Dataset Collection Using Telemetry

- Llama Stack offers built-in telemetry to collect traces and data about your agentic application.
- In this example, we will show how to build an Agent with Llama Stack, and query the agent's traces into an online dataset that can be used for evaluation.  

##### 🚧 Patches 🚧
- The following cells are temporary patches to get `telemetry` working.

In [None]:
# need to install on latest main
!pip uninstall llama-stack
!pip install git+https://github.com/meta-llama/llama-stack.git@main

Found existing installation: llama_stack 0.0.61
Uninstalling llama_stack-0.0.61:
  Would remove:
    /usr/local/bin/install-wheel-from-presigned
    /usr/local/bin/llama
    /usr/local/lib/python3.10/dist-packages/llama_stack-0.0.61.dist-info/*
    /usr/local/lib/python3.10/dist-packages/llama_stack/*
Proceed (Y/n)? Y
  Successfully uninstalled llama_stack-0.0.61
Collecting git+https://github.com/meta-llama/llama-stack.git@main
  Cloning https://github.com/meta-llama/llama-stack.git (to revision main) to /tmp/pip-req-build-oryyzdm1
  Running command git clone --filter=blob:none --quiet https://github.com/meta-llama/llama-stack.git /tmp/pip-req-build-oryyzdm1
  Resolved https://github.com/meta-llama/llama-stack.git to commit 53b3a1e345c46d7d37c1af3d675092a4cbfe85f9
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?2

In [None]:
# disable logging for clean server logs
import logging
def remove_root_handlers():
    root_logger = logging.getLogger()
    for handler in root_logger.handlers[:]:
        root_logger.removeHandler(handler)
        print(f"Removed handler {handler.__class__.__name__} from root logger")


remove_root_handlers()

Removed handler StreamHandler from root logger


##### 3.1.1. Building a Search Agent

In [None]:
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig
from google.colab import userdata

agent_config = AgentConfig(
    model="meta-llama/Llama-3.1-405B-Instruct",
    instructions="You are a helpful assistant. Use search tool to answer the questions. ",
    tools=(
        [
            {
                "type": "brave_search",
                "engine": "tavily",
                "api_key": userdata.get("TAVILY_SEARCH_API_KEY")
            }
        ]
    ),
    input_shields=[],
    output_shields=[],
    enable_session_persistence=False,
)
agent = Agent(client, agent_config)
user_prompts = [
    "Which teams played in the NBA western conference finals of 2024",
    "In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title.",
    "What is the British-American kickboxer Andrew Tate's kickboxing name?",
]

session_id = agent.create_session("test-session")

for prompt in user_prompts:
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )

    for log in EventLogger().log(response):
        log.print()

inference> Let me check the latest sports news.
inference> bravy_search.call(query="Bill Cosby South Park episode")
CustomTool> Unknown tool `bravy_search` was called.
inference> brave_search.call(query="Andrew Tate kickboxing name")
tool_execution> Tool:brave_search Args:{'query': 'Andrew Tate kickboxing name'}
tool_execution> Tool:brave_search Response:{"query": "Andrew Tate kickboxing name", "top_k": [{"title": "Andrew Tate kickboxing record: How many championships ... - FirstSportz", "url": "https://firstsportz.com/mma-how-many-championships-does-andrew-tate-have/", "content": "Andrew Tate's Kickboxing career. During his kickboxing career, he used the nickname \"King Cobra,\" which he currently uses as his Twitter name. Tate had an unorthodox style of movement inside the ring. He kept his hands down most of the time and relied on quick jabs and an overhand right to land significant strikes.", "score": 0.9996244, "raw_content": null}, {"title": "Andrew Tate: Kickboxing Record, Facts

##### 3.1.2 Query Telemetry

In [None]:
print(f"Getting traces for session_id={session_id}")
import json
from rich.pretty import pprint

agent_logs = []

for span in client.telemetry.query_spans(
    attribute_filters=[
      {"key": "session_id", "op": "eq", "value": session_id},
    ],
    attributes_to_return=["input", "output"]
  ):
  if span.attributes["output"] != "no shields":
    agent_logs.append(span.attributes)

pprint(agent_logs)

Getting traces for session_id=ac651ce8-2281-47f2-8814-ef947c066e40


##### 3.1.3 Post-Process Telemetry Results & Evaluate

- Now, we want to run evaluation to assert that our search agent succesfully calls brave_search from online traces.
- We will first post-process the agent's telemetry logs and run evaluation.

In [None]:
# post-process telemetry spance and prepare data for eval
# in this case, we want to assert that all user prompts is followed by a tool call
import ast
import json

eval_rows = []

for log in agent_logs:
  last_msg = log['input'][-1]
  if "\"role\":\"user\"" in last_msg:
    eval_rows.append(
        {
            "input_query": last_msg,
            "generated_answer": log["output"],
            # check if generated_answer uses tools brave_search
            "expected_answer": "brave_search",
        },
    )

pprint(eval_rows)
scoring_params = {
    "basic::subset_of": None,
}
scoring_response = client.scoring.score(input_rows=eval_rows, scoring_functions=scoring_params)
pprint(scoring_response)

#### 3.2. Agentic Application Dataset Scoring
- Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.

- In this example, we will work with an example RAG dataset you have built previously, label with an annotation, and use LLM-As-Judge with custom judge prompt for scoring. Please checkout our [Llama Stack Playground](https://llama-stack.readthedocs.io/en/latest/playground/index.html) for an interactive interface to upload datasets and run scorings.

In [None]:
import rich
from rich.pretty import pprint

judge_model_id = "meta-llama/Llama-3.1-405B-Instruct-FP8"

JUDGE_PROMPT = """
Given a QUESTION and GENERATED_RESPONSE and EXPECTED_RESPONSE.

Compare the factual content of the GENERATED_RESPONSE with the EXPECTED_RESPONSE. Ignore any differences in style, grammar, or punctuation.
  The GENERATED_RESPONSE may either be a subset or superset of the EXPECTED_RESPONSE, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
  (A) The GENERATED_RESPONSE is a subset of the EXPECTED_RESPONSE and is fully consistent with it.
  (B) The GENERATED_RESPONSE is a superset of the EXPECTED_RESPONSE and is fully consistent with it.
  (C) The GENERATED_RESPONSE contains all the same details as the EXPECTED_RESPONSE.
  (D) There is a disagreement between the GENERATED_RESPONSE and the EXPECTED_RESPONSE.
  (E) The answers differ, but these differences don't matter from the perspective of factuality.

Give your answer in the format "Answer: One of ABCDE, Explanation: ".

Your actual task:

QUESTION: {input_query}
GENERATED_RESPONSE: {generated_answer}
EXPECTED_RESPONSE: {expected_answer}
"""

input_query = "What are the top 5 topics that were explained? Only list succinct bullet points."
generated_answer = """
Here are the top 5 topics that were explained in the documentation for Torchtune:

* What is LoRA and how does it work?
* Fine-tuning with LoRA: memory savings and parameter-efficient finetuning
* Running a LoRA finetune with Torchtune: overview and recipe
* Experimenting with different LoRA configurations: rank, alpha, and attention modules
* LoRA finetuning
"""
expected_answer = """LoRA"""

rows = [
    {
        "input_query": input_query,
        "generated_answer": generated_answer,
        "expected_answer": expected_answer,
    },
]

scoring_params = {
    "llm-as-judge::base": {
        "judge_model": judge_model_id,
        "prompt_template": JUDGE_PROMPT,
        "type": "llm_as_judge",
        "judge_score_regexes": ["Answer: (A|B|C|D|E)"],
    },
    "basic::subset_of": None,
}

response = client.scoring.score(input_rows=rows, scoring_functions=scoring_params)
pprint(response)