# Llama Stack Showcase


<img src="https://llama-stack.readthedocs.io/en/latest/_images/llama-stack.png" alt="drawing" width="500"/>

[Llama Stack](https://github.com/meta-llama/llama-stack) defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.

Read more about the project: https://llama-stack.readthedocs.io/en/latest/index.html

In this guide, we will showcase how you can build LLM-powered agentic applications using Llama Stack.


## 1. Getting started with Llama Stack

### 1.1. Create TogetherAI account


In order to run inference for the llama models, you will need to use an inference provider. Llama stack supports a number of inference [providers](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/remote/inference).


In this showcase, we will use [together.ai](https://www.together.ai/) as the inference provider. So, you would first get an API key from Together if you dont have one already.

Steps [here](https://docs.google.com/document/d/1Vg998IjRW_uujAPnHdQ9jQWvtmkZFt74FldW2MblxPY/edit?usp=sharing).

You can also use Fireworks.ai or even Ollama if you would like to.



> **Note:**  Set the API Key in the Secrets of this notebook



### 1.2. Install Llama Stack

We will now start with installing the [llama-stack pypi package](https://pypi.org/project/llama-stack).

In addition, we will install [bubblewrap](https://github.com/containers/bubblewrap), a low level light-weight container framework that runs in the user namespace. We will use it to execute code generated by Llama in one of the examples.

In [None]:
!apt-get install -y bubblewrap
!pip install -U llama-stack

### 1.3. Configure Llama Stack for Together


Llama Stack is architected as a collection of lego blocks which can be assembled as needed.


Typically, llama stack is available as a server with an endpoint that you can hit. We call this endpoint a [Distribution](https://llama-stack.readthedocs.io/en/latest/concepts/index.html#distributions). Partners like Together and Fireworks offer their own Llama Stack Distribution endpoints.

In this showcase, we are going to use llama stack inline as a library. So, given a particular set of providers, we must first package up the right set of dependencies. We have a template to use Together as an inference provider and [faiss](https://ai.meta.com/tools/faiss/) for memory/RAG.

We will run `llama stack build` to deploy all dependencies.

In [None]:
# This will build all the dependencies you will need
!llama stack build --template together --image-type venv

### 1.4. Initialize Llama Stack

Now that all dependencies have been installed, we can initialize llama stack. We will first set the `TOGETHER_API_KEY` environment variable


In [None]:
import os
from google.colab import userdata

os.environ['TOGETHER_API_KEY'] = userdata.get('TOGETHER_API_KEY')

from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("together")
_ = client.initialize()

### 1.5. Check available models and shields

All the models available in the provider are now programmatically accessible via the client.

In [None]:
print("Available models:")
for m in client.models.list():
    print(f"{m.identifier} (provider's alias: {m.provider_resource_id}) ")

print("----")
print("Available shields (safety models):")
for s in client.shields.list():
    print(s.identifier)
print("----")

Available models:
meta-llama/Llama-3.1-8B-Instruct (provider's alias: meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo) 
meta-llama/Llama-3.1-70B-Instruct (provider's alias: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo) 
meta-llama/Llama-3.1-405B-Instruct-FP8 (provider's alias: meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo) 
meta-llama/Llama-3.2-3B-Instruct (provider's alias: meta-llama/Llama-3.2-3B-Instruct-Turbo) 
meta-llama/Llama-3.2-11B-Vision-Instruct (provider's alias: meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo) 
meta-llama/Llama-3.2-90B-Vision-Instruct (provider's alias: meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo) 
meta-llama/Llama-Guard-3-8B (provider's alias: meta-llama/Meta-Llama-Guard-3-8B) 
meta-llama/Llama-Guard-3-11B-Vision (provider's alias: meta-llama/Llama-Guard-3-11B-Vision-Turbo) 
----
Available shields (safety models):
meta-llama/Llama-Guard-3-8B
----


### 1.6. Pick the model

We will use Llama3.2-3B-Instruct for our examples.

In [None]:
model_id = "meta-llama/Llama-3.2-3B-Instruct"

### 1.7. Run a simple chat completion

We will test the client by doing a simple chat completion.
We will use Llama3.1-8B-Instruct for our examples.

In [None]:
response = client.inference.chat_completion(
    model_id=model_id,
    messages=[
        {"role": "system", "content": "You are a friendly assistant."},
        {"role": "user", "content": "Write a two-sentence poem about llama."}
    ],
)

print(response.completion_message.content)

Here is a two-sentence poem about llamas:

With soft fur and gentle eyes, the llama wanders by,
A quiet companion, roaming the Andean sky.


### 1.8. Have a conversation

Maintaining a conversation history allows the model to retain context from previous interactions. Use a list to accumulate messages, enabling continuity throughout the chat session.

Remember to type `quit` or `exit` after you are done chatting.

In [None]:
from termcolor import cprint

async def chat_loop():
    conversation_history = []
    while True:
        user_input = input('User> ')
        if user_input.lower() in ['exit', 'quit', 'bye']:
            cprint('Ending conversation. Goodbye!', 'yellow')
            break

        user_message = {"role": "user", "content": user_input}
        conversation_history.append(user_message)

        response = client.inference.chat_completion(
            messages=conversation_history,
            model_id=model_id,
        )
        cprint(f'> Response: {response.completion_message.content}', 'cyan')

        assistant_message = {
            "role": "user",
            "content": response.completion_message.content,
        }
        conversation_history.append(assistant_message)

await chat_loop()


### 1.9. Streaming output

You can pass `stream=True` to stream responses from the model. You can then loop through the responses.

In [None]:
from llama_stack_client.lib.inference.event_logger import EventLogger

message = {
    "role": "user",
    "content": 'Write me a sonnet about llama'
}
cprint(f'User> {message["content"]}', 'green')

response = client.inference.chat_completion(
    messages=[message],
    model_id=model_id,
    stream=True,   # <-----------
)

# Print the tokens while they are received
for log in EventLogger().log(response):
    log.print()

## 2. Llama Stack Agents

Llama Stack provides all the building blocks needed to create sophisticated AI applications. This guide will walk you through how to use these components effectively.




<img src="https://github.com/meta-llama/llama-stack/blob/main/docs/resources/agentic-system.png?raw=true" alt="drawing" width="800"/>


Agents are characterized by having access to

1. Memory - for RAG
2. Tool calling - ability to call tools like search and code execution
3. Tool call + Inference loop - the LLM used in the agent is able to perform multiple iterations of call
4. Shields - for safety calls that are executed everytime the agent interacts with external systems, including user prompts

### 2.1. RAG Agent

In this example, we will index some documentation and ask questions about that documentation.

In [None]:
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig
from llama_stack_client.types import Attachment

urls = ["chat.rst", "llama3.rst", "datasets.rst", "lora_finetune.rst"]
attachments = [
    Attachment(
        content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
        mime_type="text/plain",
    )
    for i, url in enumerate(urls)
]

agent_config = AgentConfig(
    model=model_id,
    instructions="You are a helpful assistant",
    tools=[{"type": "memory"}],  # enable Memory aka RAG
    enable_session_persistence=False,
)

rag_agent = Agent(client, agent_config)
session_id = rag_agent.create_session("test-session")
user_prompts = [
    (
        "I am attaching documentation for Torchtune. Help me answer questions I will ask next.",
        attachments,
    ),
    (
        "What are the top 5 topics that were explained? Only list succinct bullet points.",
        None,
    ),
]
for prompt, attachments in user_prompts:
    cprint(f'User> {prompt}', 'green')
    response = rag_agent.create_turn(
        messages=[{"role": "user", "content": prompt}],
        attachments=attachments,
        session_id=session_id,
    )
    for log in EventLogger().log(response):
        log.print()

### 2.2. Search agent

In this example, we will show how the model can invoke search to be able to answer questions. We will first have to set the API key of the search tool.

Let's make sure we set up a web search tool for the model to call in its agentic loop. In this tutorial, we will use [Tavily](https://tavily.com) as our search provider.

See steps [here](https://docs.google.com/document/d/1Vg998IjRW_uujAPnHdQ9jQWvtmkZFt74FldW2MblxPY/edit?tab=t.0#heading=h.xx02wojfl2f9).

In [None]:
search_tool = {
    "type": "brave_search",
    "engine": "tavily",
    "api_key": userdata.get("TAVILY_SEARCH_API_KEY")
}
search_tool

In [None]:
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig

agent_config = AgentConfig(
    model=model_id,
    instructions="You are a helpful assistant",
    tools=[search_tool],
    input_shields=[],
    output_shields=[],
    enable_session_persistence=False,
)
agent = Agent(client, agent_config)
user_prompts = [
    "Hello",
    "Which teams played in the NBA western conference finals of 2024",
]

session_id = agent.create_session("test-session")
for prompt in user_prompts:
    cprint(f'User> {prompt}', 'green')
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )
    for log in EventLogger().log(response):
        log.print()


User> Hello
inference> Hello! How can I assist you today?
User> Which teams played in the NBA western conference finals of 2024
inference> brave_search.call(query="NBA Western Conference Finals 2024 teams")
tool_execution> Tool:brave_search Args:{'query': 'NBA Western Conference Finals 2024 teams'}
tool_execution> Tool:brave_search Response:{"query": "NBA Western Conference Finals 2024 teams", "top_k": [{"title": "2024 NBA Western Conference Finals - Basketball-Reference.com", "url": "https://www.basketball-reference.com/playoffs/2024-nba-western-conference-finals-mavericks-vs-timberwolves.html", "content": "2024 NBA Western Conference Finals Mavericks vs. Timberwolves League Champion: Boston Celtics. Finals MVP: Jaylen Brown (20.8 / 5.4 / 5.0) 2024 Playoff Leaders: PTS: Luka Don\u010di\u0107 (635) TRB: Luka Don\u010di\u0107 (208) AST: Luka Don\u010di\u0107 (178) WS: Derrick White (2.9) More playoffs info", "score": 0.9982658, "raw_content": null}, {"title": "2024 Playoffs: West Finals

### 2.3. Code Execution Agent

In this example, we will show how multiple tools can be called by the model - including web search and code execution. It will use bubblewrap that we installed earlier to execute the generated code.

In [None]:
agent_config = AgentConfig(
    model=model_id,
    instructions="You are a helpful assistant",
    tools=[
        search_tool,
        {
            "type": "code_interpreter",
        }
    ],
    tool_choice="required",
    input_shields=[],
    output_shields=[],
    enable_session_persistence=False,
)

codex_agent = Agent(client, agent_config)
session_id = codex_agent.create_session("test-session")

user_prompts = [
    (
        "Here is a csv, can you describe it ?",
        [
            Attachment(
                content="https://raw.githubusercontent.com/meta-llama/llama-stack-apps/main/examples/resources/inflation.csv",
                mime_type="test/csv",
            )
        ],
    ),
    ("Which year ended with the highest inflation ?", None),
    (
        "What macro economic situations that led to such high inflation in that period?",
        None,
    ),
    ("Plot average yearly inflation as a time series", None),
]

for prompt in user_prompts:
    cprint(f'User> {prompt}', 'green')
    response = codex_agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt[0],
            }
        ],
        attachments=prompt[1],
        session_id=session_id,
    )
    # for chunk in response:
    #     print(chunk)

    for log in EventLogger().log(response):
        log.print()


User> ('Here is a csv, can you describe it ?', [Attachment(content='https://raw.githubusercontent.com/meta-llama/llama-stack-apps/main/examples/resources/inflation.csv', mime_type='test/csv')])


INFO:httpx:HTTP Request: GET https://raw.githubusercontent.com/meta-llama/llama-stack-apps/main/examples/resources/inflation.csv "HTTP/1.1 200 OK"


inference> import pandas as pd

df = pd.read_csv('/tmp/tmpxpfv53gh/3Dv9UVZ0inflation.csv')

print(df.head())
tool_execution> Tool:code_interpreter Args:{'code': "import pandas as pd\n\ndf = pd.read_csv('/tmp/tmpxpfv53gh/3Dv9UVZ0inflation.csv')\n\nprint(df.head())"}
tool_execution> Tool:code_interpreter Response:completed
[stdout]
Year  Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
0  2014  1.6  1.6  1.7  1.8  2.0  1.9  1.9  1.7  1.7  1.8  1.7  1.6
1  2015  1.6  1.7  1.8  1.8  1.7  1.8  1.8  1.8  1.9  1.9  2.0  2.1
2  2016  2.2  2.3  2.2  2.1  2.2  2.2  2.2  2.3  2.2  2.1  2.1  2.2
3  2017  2.3  2.2  2.0  1.9  1.7  1.7  1.7  1.7  1.7  1.8  1.7  1.8
4  2018  1.8  1.8  2.1  2.1  2.2  2.3  2.4  2.2  2.2  2.1  2.2  2.2
[/stdout]
shield_call> No Violation
inference> The CSV file contains information about inflation rates for each month from 2014 to 2018. The data is organized into a table with 4 columns (Year, Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec) and 5 row

## 3. Llama Stack Agent Evaluations

Llama Stack offers `/eval` and `/scoring` APIs, which allows you to run evaluations on generated responses.

In [None]:
import rich
from rich.pretty import pprint

judge_model_id = "meta-llama/Llama-3.1-405B-Instruct-FP8"

JUDGE_PROMPT = """
Given a QUESTION and GENERATED_RESPONSE and EXPECTED_RESPONSE.

Compare the factual content of the GENERATED_RESPONSE with the EXPECTED_RESPONSE. Ignore any differences in style, grammar, or punctuation.
  The GENERATED_RESPONSE may either be a subset or superset of the EXPECTED_RESPONSE, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
  (A) The GENERATED_RESPONSE is a subset of the EXPECTED_RESPONSE and is fully consistent with it.
  (B) The GENERATED_RESPONSE is a superset of the EXPECTED_RESPONSE and is fully consistent with it.
  (C) The GENERATED_RESPONSE contains all the same details as the EXPECTED_RESPONSE.
  (D) There is a disagreement between the GENERATED_RESPONSE and the EXPECTED_RESPONSE.
  (E) The answers differ, but these differences don't matter from the perspective of factuality.

Give your answer in the format "Answer: One of ABCDE, Explanation: ".

Your actual task:

QUESTION: {input_query}
GENERATED_RESPONSE: {generated_answer}
EXPECTED_RESPONSE: {expected_answer}
"""

input_query = "What are the top 5 topics that were explained? Only list succinct bullet points."
generated_answer = """
Here are the top 5 topics that were explained in the documentation for Torchtune:

* What is LoRA and how does it work?
* Fine-tuning with LoRA: memory savings and parameter-efficient finetuning
* Running a LoRA finetune with Torchtune: overview and recipe
* Experimenting with different LoRA configurations: rank, alpha, and attention modules
* LoRA finetuning
"""
expected_answer = """LoRA"""

rows = [
    {
        "input_query": input_query,
        "generated_answer": generated_answer,
        "expected_answer": expected_answer,
    },
]

scoring_params = {
    "llm-as-judge::base": {
        "judge_model": judge_model_id,
        "prompt_template": JUDGE_PROMPT,
        "type": "llm_as_judge",
        "judge_score_regexes": ["Answer: (A|B|C|D|E)"],
    },
    "basic::subset_of": None,
}

response = client.scoring.score(input_rows=rows, scoring_functions=scoring_params)
pprint(response)