mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-31 11:03:54 +00:00
pre-commit fixes
This commit is contained in:
parent
967dd0aa08
commit
7e211f8553
314 changed files with 5574 additions and 11369 deletions
|
|
@ -14,18 +14,16 @@ Agents are configured using the `AgentConfig` class, which includes:
|
|||
- **Safety Shields**: Guardrails to ensure responsible AI behavior
|
||||
|
||||
```python
|
||||
from llama_stack_client.types.agent_create_params import AgentConfig
|
||||
from llama_stack_client.lib.agents.agent import Agent
|
||||
|
||||
# Configure an agent
|
||||
agent_config = AgentConfig(
|
||||
model="meta-llama/Llama-3-70b-chat",
|
||||
instructions="You are a helpful assistant that can use tools to answer questions.",
|
||||
toolgroups=["builtin::code_interpreter", "builtin::rag/knowledge_search"],
|
||||
)
|
||||
|
||||
# Create the agent
|
||||
agent = Agent(llama_stack_client, agent_config)
|
||||
agent = Agent(
|
||||
llama_stack_client,
|
||||
model="meta-llama/Llama-3-70b-chat",
|
||||
instructions="You are a helpful assistant that can use tools to answer questions.",
|
||||
tools=["builtin::code_interpreter", "builtin::rag/knowledge_search"],
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Sessions
|
||||
|
|
|
|||
|
|
@ -70,18 +70,18 @@ Each step in this process can be monitored and controlled through configurations
|
|||
from llama_stack_client import LlamaStackClient
|
||||
from llama_stack_client.lib.agents.agent import Agent
|
||||
from llama_stack_client.lib.agents.event_logger import EventLogger
|
||||
from llama_stack_client.types.agent_create_params import AgentConfig
|
||||
from rich.pretty import pprint
|
||||
|
||||
# Replace host and port
|
||||
client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
|
||||
|
||||
agent_config = AgentConfig(
|
||||
agent = Agent(
|
||||
client,
|
||||
# Check with `llama-stack-client models list`
|
||||
model="Llama3.2-3B-Instruct",
|
||||
instructions="You are a helpful assistant",
|
||||
# Enable both RAG and tool usage
|
||||
toolgroups=[
|
||||
tools=[
|
||||
{
|
||||
"name": "builtin::rag/knowledge_search",
|
||||
"args": {"vector_db_ids": ["my_docs"]},
|
||||
|
|
@ -98,8 +98,6 @@ agent_config = AgentConfig(
|
|||
"max_tokens": 2048,
|
||||
},
|
||||
)
|
||||
|
||||
agent = Agent(client, agent_config)
|
||||
session_id = agent.create_session("monitored_session")
|
||||
|
||||
# Stream the agent's execution steps
|
||||
|
|
|
|||
|
|
@ -1,169 +1,127 @@
|
|||
# Evals
|
||||
# Evaluations
|
||||
|
||||
[](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing)
|
||||
The Llama Stack provides a set of APIs in Llama Stack for supporting running evaluations of LLM applications.
|
||||
- `/datasetio` + `/datasets` API
|
||||
- `/scoring` + `/scoring_functions` API
|
||||
- `/eval` + `/benchmarks` API
|
||||
|
||||
Llama Stack provides the building blocks needed to run benchmark and application evaluations. This guide will walk you through how to use these components to run open benchmark evaluations. Visit our [Evaluation Concepts](../concepts/evaluation_concepts.md) guide for more details on how evaluations work in Llama Stack, and our [Evaluation Reference](../references/evals_reference/index.md) guide for a comprehensive reference on the APIs.
|
||||
|
||||
### 1. Open Benchmark Model Evaluation
|
||||
|
||||
This first example walks you through how to evaluate a model candidate served by Llama Stack on open benchmarks. We will use the following benchmark:
|
||||
- [MMMU](https://arxiv.org/abs/2311.16502) (A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI): Benchmark designed to evaluate multimodal models.
|
||||
- [SimpleQA](https://openai.com/index/introducing-simpleqa/): Benchmark designed to access models to answer short, fact-seeking questions.
|
||||
This guides walks you through the process of evaluating an LLM application built using Llama Stack. Checkout the [Evaluation Reference](../references/evals_reference/index.md) guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for benchmark and application use cases. Checkout our Colab notebook on working examples with evaluations [here](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing).
|
||||
|
||||
#### 1.1 Running MMMU
|
||||
- We will use a pre-processed MMMU dataset from [llamastack/mmmu](https://huggingface.co/datasets/llamastack/mmmu). The preprocessing code is shown in in this [Github Gist](https://gist.github.com/yanxi0830/118e9c560227d27132a7fd10e2c92840). The dataset is obtained by transforming the original [MMMU/MMMU](https://huggingface.co/datasets/MMMU/MMMU) dataset into correct format by `inference/chat-completion` API.
|
||||
|
||||
## Application Evaluation
|
||||
|
||||
[](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb)
|
||||
|
||||
Llama Stack offers a library of scoring functions and the `/scoring` API, allowing you to run evaluations on your pre-annotated AI application datasets.
|
||||
|
||||
In this example, we will show you how to:
|
||||
1. Build an Agent with Llama Stack
|
||||
2. Query the agent's sessions, turns, and steps
|
||||
3. Evaluate the results.
|
||||
|
||||
##### Building a Search Agent
|
||||
```python
|
||||
import datasets
|
||||
from llama_stack_client import LlamaStackClient
|
||||
from llama_stack_client.lib.agents.agent import Agent
|
||||
from llama_stack_client.lib.agents.event_logger import EventLogger
|
||||
|
||||
ds = datasets.load_dataset(path="llamastack/mmmu", name="Agriculture", split="dev")
|
||||
ds = ds.select_columns(["chat_completion_input", "input_query", "expected_answer"])
|
||||
eval_rows = ds.to_pandas().to_dict(orient="records")
|
||||
```
|
||||
client = LlamaStackClient(base_url=f"http://{HOST}:{PORT}")
|
||||
|
||||
- Next, we will run evaluation on an model candidate, we will need to:
|
||||
- Define a system prompt
|
||||
- Define an EvalCandidate
|
||||
- Run evaluate on the dataset
|
||||
|
||||
```python
|
||||
SYSTEM_PROMPT_TEMPLATE = """
|
||||
You are an expert in Agriculture whose job is to answer questions from the user using images.
|
||||
First, reason about the correct answer.
|
||||
Then write the answer in the following format where X is exactly one of A,B,C,D:
|
||||
Answer: X
|
||||
Make sure X is one of A,B,C,D.
|
||||
If you are uncertain of the correct answer, guess the most likely one.
|
||||
"""
|
||||
|
||||
system_message = {
|
||||
"role": "system",
|
||||
"content": SYSTEM_PROMPT_TEMPLATE,
|
||||
}
|
||||
|
||||
client.benchmarks.register(
|
||||
benchmark_id="meta-reference::mmmu",
|
||||
dataset_id=f"mmmu-{subset}-{split}",
|
||||
scoring_functions=["basic::regex_parser_multiple_choice_answer"],
|
||||
agent = Agent(
|
||||
client,
|
||||
model="meta-llama/Llama-3.3-70B-Instruct",
|
||||
instructions="You are a helpful assistant. Use search tool to answer the questions. ",
|
||||
tools=["builtin::websearch"],
|
||||
)
|
||||
user_prompts = [
|
||||
"Which teams played in the NBA Western Conference Finals of 2024. Search the web for the answer.",
|
||||
"In which episode and season of South Park does Bill Cosby (BSM-471) first appear? Give me the number and title. Search the web for the answer.",
|
||||
"What is the British-American kickboxer Andrew Tate's kickboxing name? Search the web for the answer.",
|
||||
]
|
||||
|
||||
response = client.eval.evaluate_rows(
|
||||
benchmark_id="meta-reference::mmmu",
|
||||
input_rows=eval_rows,
|
||||
scoring_functions=["basic::regex_parser_multiple_choice_answer"],
|
||||
task_config={
|
||||
"type": "benchmark",
|
||||
"eval_candidate": {
|
||||
"type": "model",
|
||||
"model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
|
||||
"sampling_params": {
|
||||
"strategy": {
|
||||
"type": "greedy",
|
||||
},
|
||||
"max_tokens": 4096,
|
||||
"repeat_penalty": 1.0,
|
||||
},
|
||||
"system_message": system_message,
|
||||
},
|
||||
},
|
||||
)
|
||||
```
|
||||
session_id = agent.create_session("test-session")
|
||||
|
||||
#### 1.2. Running SimpleQA
|
||||
- We will use a pre-processed SimpleQA dataset from [llamastack/evals](https://huggingface.co/datasets/llamastack/evals/viewer/evals__simpleqa) which is obtained by transforming the input query into correct format accepted by `inference/chat-completion` API.
|
||||
- Since we will be using this same dataset in our next example for Agentic evaluation, we will register it using the `/datasets` API, and interact with it through `/datasetio` API.
|
||||
for prompt in user_prompts:
|
||||
response = agent.create_turn(
|
||||
messages=[
|
||||
{
|
||||
"role": "user",
|
||||
"content": prompt,
|
||||
}
|
||||
],
|
||||
session_id=session_id,
|
||||
)
|
||||
|
||||
```python
|
||||
simpleqa_dataset_id = "huggingface::simpleqa"
|
||||
|
||||
_ = client.datasets.register(
|
||||
dataset_id=simpleqa_dataset_id,
|
||||
provider_id="huggingface",
|
||||
url={"uri": "https://huggingface.co/datasets/llamastack/evals"},
|
||||
metadata={
|
||||
"path": "llamastack/evals",
|
||||
"name": "evals__simpleqa",
|
||||
"split": "train",
|
||||
},
|
||||
dataset_schema={
|
||||
"input_query": {"type": "string"},
|
||||
"expected_answer": {"type": "string"},
|
||||
"chat_completion_input": {"type": "chat_completion_input"},
|
||||
},
|
||||
)
|
||||
|
||||
eval_rows = client.datasetio.get_rows_paginated(
|
||||
dataset_id=simpleqa_dataset_id,
|
||||
rows_in_page=5,
|
||||
)
|
||||
```
|
||||
|
||||
```python
|
||||
client.benchmarks.register(
|
||||
benchmark_id="meta-reference::simpleqa",
|
||||
dataset_id=simpleqa_dataset_id,
|
||||
scoring_functions=["llm-as-judge::405b-simpleqa"],
|
||||
)
|
||||
|
||||
response = client.eval.evaluate_rows(
|
||||
benchmark_id="meta-reference::simpleqa",
|
||||
input_rows=eval_rows.rows,
|
||||
scoring_functions=["llm-as-judge::405b-simpleqa"],
|
||||
task_config={
|
||||
"type": "benchmark",
|
||||
"eval_candidate": {
|
||||
"type": "model",
|
||||
"model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
|
||||
"sampling_params": {
|
||||
"strategy": {
|
||||
"type": "greedy",
|
||||
},
|
||||
"max_tokens": 4096,
|
||||
"repeat_penalty": 1.0,
|
||||
},
|
||||
},
|
||||
},
|
||||
)
|
||||
for log in EventLogger().log(response):
|
||||
log.print()
|
||||
```
|
||||
|
||||
|
||||
### 2. Agentic Evaluation
|
||||
- In this example, we will demonstrate how to evaluate a agent candidate served by Llama Stack via `/agent` API.
|
||||
- We will continue to use the SimpleQA dataset we used in previous example.
|
||||
- Instead of running evaluation on model, we will run the evaluation on a Search Agent with access to search tool. We will define our agent evaluation candidate through `AgentConfig`.
|
||||
##### Query Agent Execution Steps
|
||||
|
||||
Now, let's look deeper into the agent's execution steps and see if how well our agent performs.
|
||||
```python
|
||||
# query the agents session
|
||||
from rich.pretty import pprint
|
||||
|
||||
session_response = client.agents.session.retrieve(
|
||||
session_id=session_id,
|
||||
agent_id=agent.agent_id,
|
||||
)
|
||||
|
||||
pprint(session_response)
|
||||
```
|
||||
|
||||
As a sanity check, we will first check if all user prompts is followed by a tool call to `brave_search`.
|
||||
```python
|
||||
num_tool_call = 0
|
||||
for turn in session_response.turns:
|
||||
for step in turn.steps:
|
||||
if (
|
||||
step.step_type == "tool_execution"
|
||||
and step.tool_calls[0].tool_name == "brave_search"
|
||||
):
|
||||
num_tool_call += 1
|
||||
|
||||
print(
|
||||
f"{num_tool_call}/{len(session_response.turns)} user prompts are followed by a tool call to `brave_search`"
|
||||
)
|
||||
```
|
||||
|
||||
##### Evaluate Agent Responses
|
||||
Now, we want to evaluate the agent's responses to the user prompts.
|
||||
|
||||
1. First, we will process the agent's execution history into a list of rows that can be used for evaluation.
|
||||
2. Next, we will label the rows with the expected answer.
|
||||
3. Finally, we will use the `/scoring` API to score the agent's responses.
|
||||
|
||||
```python
|
||||
agent_config = {
|
||||
"model": "meta-llama/Llama-3.1-405B-Instruct",
|
||||
"instructions": "You are a helpful assistant",
|
||||
"sampling_params": {
|
||||
"strategy": {
|
||||
"type": "greedy",
|
||||
},
|
||||
},
|
||||
"tools": [
|
||||
eval_rows = []
|
||||
|
||||
expected_answers = [
|
||||
"Dallas Mavericks and the Minnesota Timberwolves",
|
||||
"Season 4, Episode 12",
|
||||
"King Cobra",
|
||||
]
|
||||
|
||||
for i, turn in enumerate(session_response.turns):
|
||||
eval_rows.append(
|
||||
{
|
||||
"type": "brave_search",
|
||||
"engine": "tavily",
|
||||
"api_key": userdata.get("TAVILY_SEARCH_API_KEY"),
|
||||
"input_query": turn.input_messages[0].content,
|
||||
"generated_answer": turn.output_message.content,
|
||||
"expected_answer": expected_answers[i],
|
||||
}
|
||||
],
|
||||
"tool_choice": "auto",
|
||||
"input_shields": [],
|
||||
"output_shields": [],
|
||||
"enable_session_persistence": False,
|
||||
}
|
||||
)
|
||||
|
||||
response = client.eval.evaluate_rows(
|
||||
benchmark_id="meta-reference::simpleqa",
|
||||
input_rows=eval_rows.rows,
|
||||
scoring_functions=["llm-as-judge::405b-simpleqa"],
|
||||
task_config={
|
||||
"type": "benchmark",
|
||||
"eval_candidate": {
|
||||
"type": "agent",
|
||||
"config": agent_config,
|
||||
},
|
||||
},
|
||||
pprint(eval_rows)
|
||||
|
||||
scoring_params = {
|
||||
"basic::subset_of": None,
|
||||
}
|
||||
scoring_response = client.scoring.score(
|
||||
input_rows=eval_rows, scoring_functions=scoring_params
|
||||
)
|
||||
pprint(scoring_response)
|
||||
```
|
||||
|
|
|
|||
|
|
@ -1,30 +0,0 @@
|
|||
## Testing & Evaluation
|
||||
|
||||
Llama Stack provides built-in tools for evaluating your applications:
|
||||
|
||||
1. **Benchmarking**: Test against standard datasets
|
||||
2. **Application Evaluation**: Score your application's outputs
|
||||
3. **Custom Metrics**: Define your own evaluation criteria
|
||||
|
||||
Here's how to set up basic evaluation:
|
||||
|
||||
```python
|
||||
# Create an evaluation task
|
||||
response = client.benchmarks.register(
|
||||
benchmark_id="my_eval",
|
||||
dataset_id="my_dataset",
|
||||
scoring_functions=["accuracy", "relevance"],
|
||||
)
|
||||
|
||||
# Run evaluation
|
||||
job = client.eval.run_eval(
|
||||
benchmark_id="my_eval",
|
||||
task_config={
|
||||
"type": "app",
|
||||
"eval_candidate": {"type": "agent", "config": agent_config},
|
||||
},
|
||||
)
|
||||
|
||||
# Get results
|
||||
result = client.eval.job_result(benchmark_id="my_eval", job_id=job.job_id)
|
||||
```
|
||||
|
|
@ -20,6 +20,11 @@ We may add more storage types like Graph IO in the future.
|
|||
Here's how to set up a vector database for RAG:
|
||||
|
||||
```python
|
||||
# Create http client
|
||||
from llama_stack_client import LlamaStackClient
|
||||
|
||||
client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")
|
||||
|
||||
# Register a vector db
|
||||
vector_db_id = "my_documents"
|
||||
response = client.vector_dbs.register(
|
||||
|
|
@ -81,15 +86,14 @@ results = client.tool_runtime.rag_tool.query(
|
|||
One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example:
|
||||
|
||||
```python
|
||||
from llama_stack_client.types.agent_create_params import AgentConfig
|
||||
from llama_stack_client.lib.agents.agent import Agent
|
||||
|
||||
# Configure agent with memory
|
||||
agent_config = AgentConfig(
|
||||
# Create agent with memory
|
||||
agent = Agent(
|
||||
client,
|
||||
model="meta-llama/Llama-3.3-70B-Instruct",
|
||||
instructions="You are a helpful assistant",
|
||||
enable_session_persistence=False,
|
||||
toolgroups=[
|
||||
tools=[
|
||||
{
|
||||
"name": "builtin::rag/knowledge_search",
|
||||
"args": {
|
||||
|
|
@ -98,8 +102,6 @@ agent_config = AgentConfig(
|
|||
}
|
||||
],
|
||||
)
|
||||
|
||||
agent = Agent(client, agent_config)
|
||||
session_id = agent.create_session("rag_session")
|
||||
|
||||
|
||||
|
|
@ -122,7 +124,7 @@ response = agent.create_turn(
|
|||
],
|
||||
documents=[
|
||||
{
|
||||
"content": "https://raw.githubusercontent.com/example/doc.rst",
|
||||
"content": "https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/memory_optimizations.rst",
|
||||
"mime_type": "text/plain",
|
||||
}
|
||||
],
|
||||
|
|
@ -136,6 +138,14 @@ response = agent.create_turn(
|
|||
)
|
||||
```
|
||||
|
||||
You can print the response with below.
|
||||
```python
|
||||
from llama_stack_client.lib.agents.event_logger import EventLogger
|
||||
|
||||
for log in EventLogger().log(response):
|
||||
log.print()
|
||||
```
|
||||
|
||||
### Unregistering Vector DBs
|
||||
|
||||
If you need to clean up and unregister vector databases, you can do so as follows:
|
||||
|
|
|
|||
|
|
@ -5,7 +5,7 @@ An example of this would be a "db_access" tool group that contains tools for int
|
|||
|
||||
Tools are treated as any other resource in llama stack like models. You can register them, have providers for them etc.
|
||||
|
||||
When instatiating an agent, you can provide it a list of tool groups that it has access to. Agent gets the corresponding tool definitions for the specified tool groups and passes them along to the model.
|
||||
When instantiating an agent, you can provide it a list of tool groups that it has access to. Agent gets the corresponding tool definitions for the specified tool groups and passes them along to the model.
|
||||
|
||||
Refer to the [Building AI Applications](https://github.com/meta-llama/llama-stack/blob/main/docs/getting_started.ipynb) notebook for more examples on how to use tools.
|
||||
|
||||
|
|
@ -60,7 +60,7 @@ Features:
|
|||
- Disabled dangerous system operations
|
||||
- Configurable execution timeouts
|
||||
|
||||
> ⚠️ Important: The code interpreter tool can operate in a controlled enviroment locally or on Podman containers. To ensure proper functionality in containerised environments:
|
||||
> ⚠️ Important: The code interpreter tool can operate in a controlled environment locally or on Podman containers. To ensure proper functionality in containerized environments:
|
||||
> - The container requires privileged access (e.g., --privileged).
|
||||
> - Users without sufficient permissions may encounter permission errors. (`bwrap: Can't mount devpts on /newroot/dev/pts: Permission denied`)
|
||||
> - 🔒 Security Warning: Privileged mode grants elevated access and bypasses security restrictions. Use only in local, isolated, or controlled environments.
|
||||
|
|
@ -127,15 +127,11 @@ MCP tools require:
|
|||
|
||||
## Adding Custom Tools
|
||||
|
||||
When you want to use tools other than the built-in tools, you can implement a python function and decorate it with `@client_tool`.
|
||||
When you want to use tools other than the built-in tools, you just need to implement a python function with a docstring. The content of the docstring will be used to describe the tool and the parameters and passed
|
||||
along to the generative model.
|
||||
|
||||
To define a custom tool, you need to use the `@client_tool` decorator.
|
||||
```python
|
||||
from llama_stack_client.lib.agents.client_tool import client_tool
|
||||
|
||||
|
||||
# Example tool definition
|
||||
@client_tool
|
||||
def my_tool(input: int) -> int:
|
||||
"""
|
||||
Runs my awesome tool.
|
||||
|
|
@ -149,15 +145,7 @@ def my_tool(input: int) -> int:
|
|||
Once defined, simply pass the tool to the agent config. `Agent` will take care of the rest (calling the model with the tool definition, executing the tool, and returning the result to the model for the next iteration).
|
||||
```python
|
||||
# Example agent config with client provided tools
|
||||
client_tools = [
|
||||
my_tool,
|
||||
]
|
||||
|
||||
agent_config = AgentConfig(
|
||||
...,
|
||||
client_tools=[client_tool.get_tool_definition() for client_tool in client_tools],
|
||||
)
|
||||
agent = Agent(client, agent_config, client_tools)
|
||||
agent = Agent(client, ..., tools=[my_tool])
|
||||
```
|
||||
|
||||
Refer to [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/e2e_loop_with_client_tools.py) for an example of how to use client provided tools.
|
||||
|
|
@ -194,10 +182,10 @@ group_tools = client.tools.list_tools(toolgroup_id="search_tools")
|
|||
|
||||
```python
|
||||
from llama_stack_client.lib.agents.agent import Agent
|
||||
from llama_stack_client.types.agent_create_params import AgentConfig
|
||||
|
||||
# Configure the AI agent with necessary parameters
|
||||
agent_config = AgentConfig(
|
||||
# Instantiate the AI agent with the given configuration
|
||||
agent = Agent(
|
||||
client,
|
||||
name="code-interpreter",
|
||||
description="A code interpreter agent for executing Python code snippets",
|
||||
instructions="""
|
||||
|
|
@ -205,14 +193,10 @@ agent_config = AgentConfig(
|
|||
Always show the generated code, never generate your own code, and never anticipate results.
|
||||
""",
|
||||
model="meta-llama/Llama-3.2-3B-Instruct",
|
||||
toolgroups=["builtin::code_interpreter"],
|
||||
tools=["builtin::code_interpreter"],
|
||||
max_infer_iters=5,
|
||||
enable_session_persistence=False,
|
||||
)
|
||||
|
||||
# Instantiate the AI agent with the given configuration
|
||||
agent = Agent(client, agent_config)
|
||||
|
||||
# Start a session
|
||||
session_id = agent.create_session("tool_session")
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue