Commit graph

66 commits

Author SHA1 Message Date
ehhuang
a505bf45a3
feat(api): remove tool_name from ToolResponseMessage (#1599)
Summary:
This is not used anywhere.

closes #1421 

Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct --record-responses
2025-03-12 19:41:48 -07:00
ehhuang
ed6caead72
chore: simplify _get_tool_defs (#1384)
Summary:

Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
2025-03-12 18:51:18 -07:00
ehhuang
41c9bca1aa
chore: refactor Agent toolgroup processing (#1381)
Summary:
Refactoring only.

Centralize logic to preprocess toolgroup to one place. 

Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/api/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1381).
* #1384
* __->__ #1381
2025-03-12 18:48:03 -07:00
ehhuang
b7a9c45477
chore: deprecate ToolResponseMessage in agent.resume API (#1566)
# Summary:
closes #1431 

# Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
2025-03-12 12:10:21 -07:00
Dinesh Yeduguru
ead9397e22
fix: tracing fixes for trace context propogation across coroutines (#1522)
# What does this PR do?
This PR has two fixes needed for correct trace context propagation
across asycnio boundary
Fix 1: Start using context vars to store the global trace context.
This is needed since we cannot use the same trace context across
coroutines since the state is shared. each coroutine
should have its own trace context so that each of it can start storing
its state correctly.
Fix 2: Start a new span for each new coroutines started for running
shields to keep the span tree clean


## Test Plan

### Integration tests with server
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run
~/.llama/distributions/together/together-run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -s --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
server logs:
https://gist.github.com/dineshyv/51ac5d9864ed031d0d89ce77352821fe
test logs:
https://gist.github.com/dineshyv/e66acc1c4648a42f1854600609c467f3
 
### Integration tests with library client
LLAMA_STACK_CONFIG=fireworks pytest -s --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct

logs: https://gist.github.com/dineshyv/ca160696a0b167223378673fb1dcefb8

### Apps test with server:
```
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/together/together-run.yaml
python -m examples.agents.e2e_loop_with_client_tools localhost 8321
```
server logs:
https://gist.github.com/dineshyv/1717a572d8f7c14279c36123b79c5797
app logs:
https://gist.github.com/dineshyv/44167e9f57806a0ba3b710c32aec02f8
2025-03-11 07:12:48 -07:00
ehhuang
23e39cc3c4
fix: handle log errors (#1499)
Summary:
| File
"/Users/erichuang/projects/llama-stack/llama_stack/distribution/server/server.py",
line 213, in sse_generator
    |     logger.exception(f"Error in sse_generator: {e}")
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 1864, in exception
    |     self.log(ERROR, msg, *args, exc_info=exc_info, **kwargs)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 1879, in log
    |     self.logger.log(level, msg, *args, **kwargs)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 1547, in log
    |     self._log(level, msg, args, **kwargs)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 1624, in _log
    |     self.handle(record)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 1634, in handle
    |     self.callHandlers(record)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 1696, in callHandlers
    |     hdlr.handle(record)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 968, in handle
    |     self.emit(record)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py",
line 167, in emit
    |     message_renderable = self.render_message(record, message)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py",
line 193, in render_message
| message_text = Text.from_markup(message) if use_markup else
Text(message)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/text.py",
line 287, in from_markup
| rendered_text = render(text, style, emoji=emoji,
emoji_variant=emoji_variant)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/markup.py",
line 167, in render
    |     raise MarkupError(
| rich.errors.MarkupError: closing tag '[/INST]' at position 105 doesn't
match any open tag


Test Plan:
reran failing rag_with_vector_db example
2025-03-07 15:58:26 -08:00
ehhuang
acbae66b9d
chore: escape tool output for logging (#1490)
Summary:

error:


llama_stack/providers/inline/agents/meta_reference/agent_instance.py:1032:
in execute_tool_call_maybe
    logger.info(f"tool call {name} completed with result: {result}")

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1841:
in info
    self.log(INFO, msg, *args, **kwargs)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1879:
in log
    self.logger.log(level, msg, *args, **kwargs)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1547:
in log
    self._log(level, msg, args, **kwargs)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1624:
in _log
    self.handle(record)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1634:
in handle
    self.callHandlers(record)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1696:
in callHandlers
    hdlr.handle(record)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:968:
in handle
    self.emit(record)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py:167:
in emit
    message_renderable = self.render_message(record, message)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py:193:
in render_message
message_text = Text.from_markup(message) if use_markup else
Text(message)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/text.py:287:
in from_markup
rendered_text = render(text, style, emoji=emoji,
emoji_variant=emoji_variant)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/markup.py:167:
in render
    raise MarkupError(
E rich.errors.MarkupError: closing tag '[/INST]' at position 3274
doesn't match any open tag

Test Plan:
2025-03-07 13:33:45 -08:00
Sébastien Han
7cf1e24c4e
feat(logging): implement category-based logging (#1362)
# What does this PR do?

This commit introduces a new logging system that allows loggers to be
assigned
a category while retaining the logger name based on the file name. The
log
format includes both the logger name and the category, producing output
like:

```
INFO     2025-03-03 21:44:11,323 llama_stack.distribution.stack:103 [core]: Tool_groups: builtin::websearch served by
         tavily-search
```

Key features include:

- Category-based logging: Loggers can be assigned a category (e.g.,
  "core", "server") when programming. The logger can be loaded like
  this: `logger = get_logger(name=__name__, category="server")`
- Environment variable control: Log levels can be configured
per-category using the
  `LLAMA_STACK_LOGGING` environment variable. For example:
`LLAMA_STACK_LOGGING="server=DEBUG;core=debug"` enables DEBUG level for
the "server"
    and "core" categories.
- `LLAMA_STACK_LOGGING="all=debug"` sets DEBUG level globally for all
categories and
    third-party libraries.

This provides fine-grained control over logging levels while maintaining
a clean and
informative log format.

The formatter uses the rich library which provides nice colors better
stack traces like so:

```
ERROR    2025-03-03 21:49:37,124 asyncio:1758 [uncategorized]: unhandled exception during asyncio.run() shutdown
         task: <Task finished name='Task-16' coro=<handle_signal.<locals>.shutdown() done, defined at
         /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:146>
         exception=UnboundLocalError("local variable 'loop' referenced before assignment")>
         ╭────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────╮
         │ /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:178 in shutdown                │
         │                                                                                                                │
         │   175 │   │   except asyncio.CancelledError:                                                                   │
         │   176 │   │   │   pass                                                                                         │
         │   177 │   │   finally:                                                                                         │
         │ ❱ 178 │   │   │   loop.stop()                                                                                  │
         │   179 │                                                                                                        │
         │   180 │   loop = asyncio.get_running_loop()                                                                    │
         │   181 │   loop.create_task(shutdown())                                                                         │
         ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
         UnboundLocalError: local variable 'loop' referenced before assignment
```

Co-authored-by: Ashwin Bharambe <@ashwinb>
Signed-off-by: Sébastien Han <seb@redhat.com>

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

```
python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml
INFO     2025-03-03 21:55:35,918 __main__:365 [server]: Using config file: llama_stack/templates/ollama/run.yaml           
INFO     2025-03-03 21:55:35,925 __main__:378 [server]: Run configuration:                                                 
INFO     2025-03-03 21:55:35,928 __main__:380 [server]: apis:                                                              
         - agents                                                     
``` 
[//]: # (## Documentation)

---------

Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-03-07 11:34:30 -08:00
ehhuang
6cf79437b3
feat: support ClientTool output metadata (#1426)
# Summary:
Client side change in
https://github.com/meta-llama/llama-stack-client-python/pull/180
Changes the resume_turn API to accept `ToolResponse` instead of
`ToolResponseMessage`:
1. `ToolResponse` contains `metadata`
2. `ToolResponseMessage` is a concept for model inputs. Here we are just
submitting the outputs of tool execution.

# Test Plan:
Ran integration tests with newly added test using client tool with
metadata

LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --record-responses
2025-03-05 14:30:27 -08:00
Xi Yan
78962be996
chore: refactor create_and_execute_turn and resume_turn (#1399)
# What does this PR do?
- Closes https://github.com/meta-llama/llama-stack/issues/1212

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/integration/agents/test_agents.py --inference-model "meta-llama/Llama-3.3-70B-Instruct"
```
<img width="1203" alt="image"
src="https://github.com/user-attachments/assets/35b60017-b3f2-4e98-87f2-2868730261bd"
/>

```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/agents/test_agents.py::test_rag_and_code_agent --inference-model "meta-llama/Llama-3.3-70B-Instruct"
```

[//]: # (## Documentation)
2025-03-04 16:07:30 -08:00
ehhuang
fd8c991393
fix: rag as attachment bug (#1392)
Summary:

Test Plan:
added new test
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/api/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B
2025-03-04 13:08:16 -08:00
Xi Yan
158b6dc404
chore: deprecate allow_turn_resume (#1377)
# What does this PR do?

- Deprecate allow_turn_resume flag as this is used for staying backward
compat.
- Closes https://github.com/meta-llama/llama-stack/issues/1363

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/api/agents/test_agents.py --inference-model "meta-llama/Llama-3.3-70B-Instruct" --record-responses
```

<img width="1054" alt="image"
src="https://github.com/user-attachments/assets/d31de2d4-0953-41e1-a71a-7e1579fa351a"
/>


[//]: # (## Documentation)
2025-03-04 12:22:11 -08:00
ehhuang
07a992ef90
feat: deterministic tools ordering (#1380)
Summary:

1. The `tools` parameter we construct to pass the inference API is
non-deterministic. As a result, our recordable mocks is flaky as the
ordering change sometimes. This PR makes it so that `tools` ordering is
deterministic and aligned with the order user specified.
2. In recordable mock key generation, client tool's parameter type was
'str' and now is 'string' for some reason. I didn't dig into exactly
why, but just regenerated the fixtures.

Test Plan:
Regenerate mocks:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --record-responses
```

Rerun tests without  --record-responses:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B
```
2025-03-03 20:38:07 -08:00
Ashwin Bharambe
0a76ece249 feat: add more logs to agent_instance.py 2025-03-03 16:15:47 -08:00
Xi Yan
7d111c7510
feat: unify max_infer_iters in client/server agent loop (#1309)
# What does this PR do?

We currently use `max_infer_iters` in 2 different ways
1/ Server: track number of times 
2/ Client side: track number of times we send `resume_turn` request

This PR gets rid of the need of (2) and makes server track total number
of times we perform inference within a Turn

**NOTE**
The PR will assume StopReason is set to
- end_of_message: turn is not finished, we could be waiting for client
tool call responses
- end_of_turn: if the entire turn is finished and there's no more things
to be done.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/agents/test_agents.py::test_custom_tool_infinite_loop --inference-model "meta-llama/Llama-3.3-70B-Instruct"
```

[//]: # (## Documentation)
2025-03-03 10:08:36 -08:00
ehhuang
21ec67356c
fix: RAG with documents (#1337)
Summary:
This was broken by
https://github.com/meta-llama/llama-stack/pull/1015/files#r1975394190

Test Plan:

added e2e test
2025-02-28 16:51:00 -08:00
Sébastien Han
6fa257b475
chore(lint): update Ruff ignores for project conventions and maintainability (#1184)
- Added new ignores from flake8-bugbear (`B007`, `B008`)
- Ignored `C901` (high function complexity) for now, pending review
- Maintained PyTorch conventions (`N812`, `N817`)
- Allowed `E731` (lambda assignments) for flexibility
- Consolidated existing ignores (`E402`, `E501`, `F405`, `C408`, `N812`)
- Documented rationale for each ignored rule

This keeps our linting aligned with project needs while tracking
potential fixes.

Signed-off-by: Sébastien Han <seb@redhat.com>

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-02-28 09:36:49 -08:00
Hardik Shah
8efa53daf1
fix: Agent telemetry inputs/outputs should be structured (#1302)
Original telemetry outputs for agent turns look like this. 
Note: how output was a `str(message)` making it difficult to read them
back for downstream tasks ( eg. building eval datasets )
```
{
│   │   'input': [
│   │   │   '{"role":"system","content":"You are a helpful assistant. Use search tool to answer the questions. "}',
│   │   │   '{"role":"user","content":"Which teams played in the NBA western conference finals of 2024","context":null}'
│   │   ],
│   │   'output': "content:  tool_calls: [ToolCall(call_id='8b7294ec-a83f-4798-ad8f-6bed662f08b6', tool_name=<BuiltinTool.brave_search: 'brave_search'>, arguments={'query': 'NBA Western Conference Finals 2024 teams'})]"
│   },
``` 

Updated the outputs to be structured .

## Test 

```python
import uuid

from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct"
agent_config = AgentConfig(
    model=model_id,
    instructions="You are a helpful assistant who will use the web search tools to help with answering questions.\nOnly provide final answer in short without writing full sentences. Use web search",
    toolgroups=["builtin::websearch"],
    enable_session_persistence=True,
)

agent = Agent(client, agent_config)

session_id = agent.create_session(uuid.uuid4().hex)
response = agent.create_turn(
    messages=[
        {
            "role": "user",
            "content": "latest news about llama stack",
        }
    ],
    session_id=session_id,
    stream=False,
)

pprint(response)
```
Output: 
```
Turn(
│   input_messages=[UserMessage(content='latest news about llama stack', role='user', context=None)],
│   output_message=CompletionMessage(
│   │   content="The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.",
│   │   role='assistant',
│   │   stop_reason='end_of_turn',
│   │   tool_calls=[]
│   ),
│   session_id='77379546-4598-485a-b4f4-84e5da28c513',
│   started_at=datetime.datetime(2025, 2, 27, 11, 2, 43, 915243, tzinfo=TzInfo(-08:00)),
│   steps=[
│   │   InferenceStep(
│   │   │   api_model_response=CompletionMessage(
│   │   │   │   content='',
│   │   │   │   role='assistant',
│   │   │   │   stop_reason='end_of_turn',
│   │   │   │   tool_calls=[
│   │   │   │   │   ToolCall(
│   │   │   │   │   │   arguments={'query': 'latest news llama stack'},
│   │   │   │   │   │   call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b',
│   │   │   │   │   │   tool_name='brave_search'
│   │   │   │   │   )
│   │   │   │   ]
│   │   │   ),
│   │   │   step_id='81c16bd3-eb00-4721-8edc-f386e07391a3',
│   │   │   step_type='inference',
│   │   │   turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 11, 2, 44, 637149, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 11, 2, 43, 915831, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ToolExecutionStep(
│   │   │   step_id='4782d609-a62e-45f5-8d2a-25a43db46288',
│   │   │   step_type='tool_execution',
│   │   │   tool_calls=[
│   │   │   │   ToolCall(
│   │   │   │   │   arguments={'query': 'latest news llama stack'},
│   │   │   │   │   call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b',
│   │   │   │   │   tool_name='brave_search'
│   │   │   │   )
│   │   │   ],
│   │   │   tool_responses=[
│   │   │   │   ToolResponse(
│   │   │   │   │   call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b',
│   │   │   │   │   content='{"query": "latest news llama stack", "top_k": [{"title": "Llama 3.2: Revol. .......  Hacker News.", "score": 0.6186197, "raw_content": null}]}',
│   │   │   │   │   tool_name='brave_search',
│   │   │   │   │   metadata=None
│   │   │   │   )
│   │   │   ],
│   │   │   turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 11, 2, 46, 272176, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 11, 2, 44, 640743, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   InferenceStep(
│   │   │   api_model_response=CompletionMessage(
│   │   │   │   content="The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.",
│   │   │   │   role='assistant',
│   │   │   │   stop_reason='end_of_turn',
│   │   │   │   tool_calls=[]
│   │   │   ),
│   │   │   step_id='37994419-5da3-4e84-a010-8d9b85366262',
│   │   │   step_type='inference',
│   │   │   turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 11, 2, 48, 961275, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 11, 2, 46, 273168, tzinfo=TzInfo(-08:00))
│   │   )
│   ],
│   turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45',
│   completed_at=datetime.datetime(2025, 2, 27, 11, 2, 48, 962318, tzinfo=TzInfo(-08:00)),
│   output_attachments=[]
)

```

## Check for Telemetry 
```python 

agent_logs = []
for span in client.telemetry.query_spans(
    attribute_filters=[
      {"key": "session_id", "op": "eq", "value": session_id},
    ],
    attributes_to_return=['input', 'output'],
):
    agent_logs.append(span.attributes)

pprint(json.loads(agent_logs[-1]['output']))
```
```
{
│   'content': "The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.",
│   'tool_calls': []
}
```
2025-02-27 23:06:37 -08:00
ehhuang
a34f3aafcf
fix: don't include tool args not in the function definition (#1307)
# Summary:
Right now we would include toolgroup args when we encode messages with
tool_calls, which is confusing the model since they not in the function
description (see test plan for example).

# Test Plan:
Add a print statement before raw prompt is sent to providers (no good
way to test this currently)

Before:
```
cated in the same neighborhood?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n[knowledge_search(query="Laleli Mosque and Esma Sultan Mansion same neighborhood", vector_db_ids=["829a68735d744dc3830409dcc782964a"])]<|eot_id|><|start_header_id|>ipython<|end_header_id|>\n\nknowledge_search tool found 5 chunks:\nBEGIN of
```
Note the extra `vector_db_ids`

After
```
>user<|end_header_id|>\n\nAre the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n[knowledge_search(query="Laleli Mosque and Esma Sultan Mansion same neighborhood")]<|eot_id|><|start_header_id|>ipython<|end_header_id|>\n\nknowledge_search tool found
```
2025-02-27 16:25:30 -08:00
Xi Yan
663c6b0537
fix: duplicate ToolResponseMessage in Turn message history (#1305)
# What does this PR do?

- Reproduce with:
https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/e2e_loop_with_client_tools.py

- **Root cause**: when we have ToolResponseMessage as part of Turn, we
will create duplicate ToolResponseMessage in the conversation history
when getting messages from a Turn.
- Fix: avoid adding duplicate ToolResponseMessage from a turn's
input_messages.
- If it is part of a Turn's steps, only add it when processing the
steps.
   - If it is not part of a Turn's steps, add it. 

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

```
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/agents/test_agents.py --inference-model meta-llama/Llama-3.1-8B-Instruct
```


```
python -m examples.agents.e2e_loop_with_client_tools localhost 8321 
```

```python
Turn(
│   input_messages=[
│   │   UserMessage(
│   │   │   content='What was the closing price of Google stock (ticker symbol GOOG) for 2023 ?',
│   │   │   role='user',
│   │   │   context=None
│   │   ),
│   │   ToolResponseMessage(
│   │   │   call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94',
│   │   │   content='"[{\\"(\'Year\', \'\')\\":2023,\\"(\'Close\', \'GOOG\')\\":140.4254302979}]"',
│   │   │   role='tool',
│   │   │   tool_name='get_ticker_data'
│   │   )
│   ],
│   output_message=CompletionMessage(
│   │   content='Note: The actual closing price for 2023 may not be available or may be different from the result obtained above. The result is based on a hypothetical call to the get_ticker_data function.',
│   │   role='assistant',
│   │   stop_reason='end_of_turn',
│   │   tool_calls=[]
│   ),
│   session_id='4c791107-f0d8-456e-a27f-aa2fdc72b871',
│   started_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 412928, tzinfo=TzInfo(-08:00)),
│   steps=[
│   │   ShieldCallStep(
│   │   │   step_id='e0514587-b7d6-4bba-8609-8e05a3a46d8a',
│   │   │   step_type='shield_call',
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 858382, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 425204, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   ),
│   │   InferenceStep(
│   │   │   api_model_response=CompletionMessage(
│   │   │   │   content='',
│   │   │   │   role='assistant',
│   │   │   │   stop_reason='end_of_turn',
│   │   │   │   tool_calls=[
│   │   │   │   │   ToolCall(
│   │   │   │   │   │   arguments={
│   │   │   │   │   │   │   'ticker_symbol': 'GOOG',
│   │   │   │   │   │   │   'start': '2023-01-01',
│   │   │   │   │   │   │   'end': '2023-12-31'
│   │   │   │   │   │   },
│   │   │   │   │   │   call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94',
│   │   │   │   │   │   tool_name='get_ticker_data'
│   │   │   │   │   )
│   │   │   │   ]
│   │   │   ),
│   │   │   step_id='a3ceec6a-f149-49d5-a1c2-db461e3f6e9f',
│   │   │   step_type='inference',
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 26, 910179, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 871130, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ShieldCallStep(
│   │   │   step_id='f9339865-96ca-4425-af42-a87bab343e24',
│   │   │   step_type='shield_call',
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 383013, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 26, 944012, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   ),
│   │   ToolExecutionStep(
│   │   │   step_id='e317b74a-c4f3-4845-99a3-7d93aa6ea6c8',
│   │   │   step_type='tool_execution',
│   │   │   tool_calls=[
│   │   │   │   ToolCall(
│   │   │   │   │   arguments={'ticker_symbol': 'GOOG', 'start': '2023-01-01', 'end': '2023-12-31'},
│   │   │   │   │   call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94',
│   │   │   │   │   tool_name='get_ticker_data'
│   │   │   │   )
│   │   │   ],
│   │   │   tool_responses=[
│   │   │   │   ToolResponse(
│   │   │   │   │   call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94',
│   │   │   │   │   content='"[{\\"(\'Year\', \'\')\\":2023,\\"(\'Close\', \'GOOG\')\\":140.4254302979}]"',
│   │   │   │   │   tool_name='get_ticker_data',
│   │   │   │   │   metadata=None
│   │   │   │   )
│   │   │   ],
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 718810, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 26, 943792, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ShieldCallStep(
│   │   │   step_id='c4236616-db89-4c04-ad04-f51cfb726385',
│   │   │   step_type='shield_call',
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 958946, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 732680, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   ),
│   │   InferenceStep(
│   │   │   api_model_response=CompletionMessage(
│   │   │   │   content='Note: The actual closing price for 2023 may not be available or may be different from the result obtained above. The result is based on a hypothetical call to the get_ticker_data function.',
│   │   │   │   role='assistant',
│   │   │   │   stop_reason='end_of_turn',
│   │   │   │   tool_calls=[]
│   │   │   ),
│   │   │   step_id='3386f896-2026-41e4-a60f-f6f3c3981cf6',
│   │   │   step_type='inference',
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 74750, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 970724, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ShieldCallStep(
│   │   │   step_id='bc57ac8c-f94e-4758-bf1a-0dd734eca1cf',
│   │   │   step_type='shield_call',
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 443016, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 86726, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   )
│   ],
│   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 459456, tzinfo=TzInfo(-08:00)),
│   output_attachments=[]
)
```

```python
Turn(
│   input_messages=[
│   │   UserMessage(content='What is 40+30?', role='user', context=None),
│   │   ToolResponseMessage(
│   │   │   call_id='8e54aca9-244d-44ca-ada0-0365090e8622',
│   │   │   content='{"success": true, "result": 70.0}',
│   │   │   role='tool',
│   │   │   tool_name='calculator'
│   │   )
│   ],
│   output_message=CompletionMessage(
│   │   content='The result of the calculation is 70.',
│   │   role='assistant',
│   │   stop_reason='end_of_turn',
│   │   tool_calls=[]
│   ),
│   session_id='4c791107-f0d8-456e-a27f-aa2fdc72b871',
│   started_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 156903, tzinfo=TzInfo(-08:00)),
│   steps=[
│   │   ShieldCallStep(
│   │   │   step_id='17b6b645-31cc-4be9-a758-a4f3b741ced9',
│   │   │   step_type='shield_call',
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 780564, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 174515, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   ),
│   │   InferenceStep(
│   │   │   api_model_response=CompletionMessage(
│   │   │   │   content='',
│   │   │   │   role='assistant',
│   │   │   │   stop_reason='end_of_turn',
│   │   │   │   tool_calls=[
│   │   │   │   │   ToolCall(
│   │   │   │   │   │   arguments={'x': 40.0, 'y': 30.0, 'operation': 'add'},
│   │   │   │   │   │   call_id='8e54aca9-244d-44ca-ada0-0365090e8622',
│   │   │   │   │   │   tool_name='calculator'
│   │   │   │   │   )
│   │   │   │   ]
│   │   │   ),
│   │   │   step_id='f59e951a-2b75-497d-a075-ec9aad9aad12',
│   │   │   step_type='inference',
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 141869, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 792047, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ShieldCallStep(
│   │   │   step_id='efafa0cf-23b9-4a90-8350-3a186d80925d',
│   │   │   step_type='shield_call',
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 766293, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 177473, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   ),
│   │   ToolExecutionStep(
│   │   │   step_id='877cfbe7-57a8-4056-9c29-49aa38dd337c',
│   │   │   step_type='tool_execution',
│   │   │   tool_calls=[
│   │   │   │   ToolCall(
│   │   │   │   │   arguments={'x': 40.0, 'y': 30.0, 'operation': 'add'},
│   │   │   │   │   call_id='8e54aca9-244d-44ca-ada0-0365090e8622',
│   │   │   │   │   tool_name='calculator'
│   │   │   │   )
│   │   │   ],
│   │   │   tool_responses=[
│   │   │   │   ToolResponse(
│   │   │   │   │   call_id='8e54aca9-244d-44ca-ada0-0365090e8622',
│   │   │   │   │   content='{"success": true, "result": 70.0}',
│   │   │   │   │   tool_name='calculator',
│   │   │   │   │   metadata=None
│   │   │   │   )
│   │   │   ],
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 930899, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 177202, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ShieldCallStep(
│   │   │   step_id='d47c6160-45d9-47c1-8e39-2faae65ee468',
│   │   │   step_type='shield_call',
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 3, 510402, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 949433, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   ),
│   │   InferenceStep(
│   │   │   api_model_response=CompletionMessage(
│   │   │   │   content='The result of the calculation is 70.',
│   │   │   │   role='assistant',
│   │   │   │   stop_reason='end_of_turn',
│   │   │   │   tool_calls=[]
│   │   │   ),
│   │   │   step_id='660ba1cc-770e-471c-bf6e-11e103d74443',
│   │   │   step_type='inference',
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 4, 814944, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 3, 521309, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ShieldCallStep(
│   │   │   step_id='4dab8bb0-7d38-4465-ae1a-10069de2b3d1',
│   │   │   step_type='shield_call',
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 5, 428561, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 4, 825970, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   )
│   ],
│   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 5, 462823, tzinfo=TzInfo(-08:00)),
│   output_attachments=[]
)
```


[//]: # (## Documentation)
2025-02-27 15:06:47 -08:00
Xi Yan
fc5aff3ccf
feat: ability to retrieve agents session, turn, step by ids (#1286)
# What does this PR do?

- Fix up rotten implementation for retrieving agent's Session, Turn,
Step with actual working implementation.

- Update `getting_started` notebook with retrieving by agent session_id.
https://github.com/meta-llama/llama-stack/blob/export_agent_dataset/docs/getting_started.ipynb

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Test with script:
https://gist.github.com/yanxi0830/657cecee8f1f0e39d322963d9c0f598e

<img width="503" alt="image"
src="https://github.com/user-attachments/assets/5ea9bc33-83d1-40bc-98e1-b68393158387"
/>


[//]: # (## Documentation)
2025-02-27 09:45:14 -08:00
ehhuang
0762c61402
feat: don't silently ignore incorrect toolgroup (#1285) 2025-02-27 08:11:09 -05:00
ehhuang
c8a20b8ed0
feat: allow specifying specific tool within toolgroup (#1239)
Summary:

E.g. `builtin::rag::knowledge_search`

Test Plan:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/ --safety-shield meta-llama/Llama-Guard-3-8B
```
2025-02-26 14:07:05 -08:00
ehhuang
fca84db5b0
fix: time logging format (#1281)
Summary:
missed in last PR

Test Plan:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py::test_create_turn_response --safety-shield meta-llama/Llama-Guard-3-8B
```
2025-02-26 13:51:33 -08:00
ehhuang
bb2690f176
feat: remove special handling of builtin::rag tool (#1015)
Summary:

Lets the model decide which tool it needs to call to respond to a query.

Test Plan:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/ --safety-shield meta-llama/Llama-Guard-3-8B
```

Also evaluated on a small benchmark with 20 questions from HotpotQA.
With this PR and some prompting, the performance is 77% recall compared
to 50% currently.

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1015).
* #1268
* #1239
* __->__ #1015
2025-02-26 13:04:52 -08:00
Ben Browning
c64f0d5888
fix: Get builtin tool calling working in remote-vllm (#1236)
# What does this PR do?

This PR makes a couple of changes required to get the test
`tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search`
passing on the remote-vllm provider.

First, we adjust agent_instance to also pass in the description and
parameters of builtin tools. We need these parameters so we can pass the
tool's expected parameters into vLLM. The meta-reference implementations
may not have needed these for builtin tools, as they are able to take
advantage of the Llama-model specific support for certain builtin tools.
However, with vLLM, our server-side chat templates for tool calling
treat all tools the same and don't separate out Llama builtin vs custom
tools. So, we need to pass the full set of parameter definitions and
list of required parameters for builtin tools as well.

Next, we adjust the vllm streaming chat completion code to fix up some
edge cases where it was returning an extra ChatCompletionResponseEvent
with an empty ToolCall with empty string call_id, tool_name, and
arguments properties. This is a bug discovered after the above fix,
where after a successful tool invocation we were sending extra chunks
back to the client with these empty ToolCalls.

## Test Plan

With these changes, the following test that previously failed now
passes:

```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v \
tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search \
--inference-model "meta-llama/Llama-3.2-3B-Instruct"
```

Additionally, I ran the remote-vllm client-sdk and provider inference
tests as below to ensure they all still passed with this change:

```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v \
tests/client-sdk/inference/test_text_inference.py \
--inference-model "meta-llama/Llama-3.2-3B-Instruct"
```

```
VLLM_URL="http://localhost:8000/v1" \
python -m pytest -s -v \
llama_stack/providers/tests/inference/test_text_inference.py \
--providers "inference=vllm_remote"
```


[//]: # (## Documentation)

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-02-26 15:25:47 -05:00
Yuan Tang
1a044ef894
fix: Raise exception when tool call result is None (#1253)
# What does this PR do?

When there are issues with the tool call function, an exception is
raised but the error message is not informative. This adds a clearer
message to tell users to check their functions.
```
Traceback (most recent call last):
  File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/distribution/server/server.py", line 208, in sse_generator
    async for item in event_gen:
  File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agents.py", line 165, in _create_agent_turn_streaming
    async for event in agent.create_and_execute_turn(request):
  File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 197, in create_and_execute_turn
    async for chunk in self.run(
  File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 389, in run
    async for res in self._run(
  File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 811, in _run
    content=tool_result.content,
AttributeError: 'NoneType' object has no attribute 'content'
```

## Test Plan

Ran the same script and exception is raised with clearer error message.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-25 13:10:50 -05:00
ehhuang
dc3c881ffe
fix: include timezone in Agent steps' timestamps (#1247)
Summary:

kotlin SDK expects this format

Test Plan:

python prints the expected format
>>> str(datetime.now().astimezone())
'2025-02-24 22:02:58.729763-08:00'
2025-02-25 09:49:25 -08:00
ehhuang
25fddccfd8
feat: tool outputs metadata (#1155)
Summary:

Allows tools to output metadata. This is useful for evaluating tool
outputs, e.g. RAG tool will output document IDs, which can be used to
score recall.

Will need to make a similar change on the client side to support
ClientTool outputting metadata.

Test Plan:

LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/client-sdk/agents/test_agents.py
2025-02-21 13:15:31 -08:00
Xi Yan
0fe071764f
feat(1/n): api: unify agents for handling server & client tools (#1178)
# Problem

Our current Agent framework has discrepancies in definition on how we
handle server side and client side tools.

1. Server Tools: a single Turn is returned including `ToolExecutionStep`
in agenst
2. Client Tools: `create_agent_turn` is called in loop with client agent
lib yielding the agent chunk

ad6ffc63df/src/llama_stack_client/lib/agents/agent.py (L186-L211)

This makes it inconsistent to work with server & client tools. It also
complicates the logs to telemetry to get information about agents turn /
history for observability.

#### Principle
The same `turn_id` should be used to represent the steps required to
complete a user message including client tools.

## Solution

1. `AgentTurnResponseEventType.turn_awaiting_input` status to indicate
that the current turn is not completed, and awaiting tool input
2. `continue_agent_turn` endpoint to update agent turn with client's
tool response.


# What does this PR do?
- Skeleton API as example

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

- Just API update, no functionality change
```
llama stack run + client-sdk test
```

<img width="842" alt="image"
src="https://github.com/user-attachments/assets/7ac56b5f-f424-4632-9476-7e0f57555bc3"
/>


[//]: # (## Documentation)
2025-02-21 11:48:27 -08:00
ehhuang
ab2b46e528
feat: log start, complete time to Agent steps (#1116) 2025-02-14 17:48:06 -08:00
Ashwin Bharambe
314ee09ae3
chore: move all Llama Stack types from llama-models to llama-stack (#1098)
llama-models should have extremely minimal cruft. Its sole purpose
should be didactic -- show the simplest implementation of the llama
models and document the prompt formats, etc.

This PR is the complement to
https://github.com/meta-llama/llama-models/pull/279

## Test Plan

Ensure all `llama` CLI `model` sub-commands work:

```bash
llama model list
llama model download --model-id ...
llama model prompt-format -m ...
```

Ran tests:
```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/
LLAMA_STACK_CONFIG=fireworks pytest -s -v vector_io/
LLAMA_STACK_CONFIG=fireworks pytest -s -v agents/
```

Create a fresh venv `uv venv && source .venv/bin/activate` and run
`llama stack build --template fireworks --image-type venv` followed by
`llama stack run together --image-type venv` <-- the server runs

Also checked that the OpenAPI generator can run and there is no change
in the generated files as a result.

```bash
cd docs/openapi_generator
sh run_openapi_generator.sh
```
2025-02-14 09:10:59 -08:00
Sébastien Han
e4a1579e63
build: format codebase imports using ruff linter (#1028)
# What does this PR do?

- Configured ruff linter to automatically fix import sorting issues.
- Set --exit-non-zero-on-fix to ensure non-zero exit code when fixes are
applied.
- Enabled the 'I' selection to focus on import-related linting rules.
- Ran the linter, and formatted all codebase imports accordingly.
- Removed the black dep from the "dev" group since we use ruff

Signed-off-by: Sébastien Han <seb@redhat.com>

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-02-13 10:06:21 -08:00
Xi Yan
66d7e15c93
perf: ensure ToolCall in ChatCompletionResponse is subset of ChatCompletionRequest.tools (#1041)
# What does this PR do?

**Problem**
- Using script:
https://gist.github.com/thoraxe/6163b2145ce7b1c24c6026b64cf90085

- This hits an issue on server with `code_interpreter` not found, as we
do not pass "builtin::code_interpreter" in AgentConfig's `toolgroups`.

This is a general issue where model always tries to output
`code_interpreter` in `ToolCall` even when we do not have
`code_interpreter` available for execution.

**Reproduce Deeper Problem in chat-completion**
- Use script:
https://gist.github.com/yanxi0830/163a9ad7b5db10556043fbfc7ecd7603

1. We currently always populate `code_interpreter` in `ToolCall` in
ChatCompletionResponse if the model's response begins with
`<|python_tag|>`. See
c5f5958498/models/llama3/api/chat_format.py (L200-L213)

<img width="913" alt="image"
src="https://github.com/user-attachments/assets/328d313d-0a0b-495c-8715-61cca9ccc4a6"
/>

2. This happens even if we do not pass the `code_interpreter` as a
`tools` in ChatCompletionRequest.

**This PR**

Explicitly make sure that the tools returned in
`ChatCompletionResponse.tool_calls` is always a tool requested by
`ChatCompletionRequest.tools`.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

**Before**
<img width="913" alt="image"
src="https://github.com/user-attachments/assets/328d313d-0a0b-495c-8715-61cca9ccc4a6"
/>
<img width="997" alt="image"
src="https://github.com/user-attachments/assets/d3e82b62-b142-4939-954c-62843bec7110"
/>


**After**
<img width="856" alt="image"
src="https://github.com/user-attachments/assets/2c70ce55-c8d0-45ea-b10f-f70adc50d3d9"
/>
<img width="1000" alt="image"
src="https://github.com/user-attachments/assets/b5e81826-c35b-4052-bf81-7afff93ce2ef"
/>



**Unit Test**
```
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request --inference-model "meta-llama/Llama-3.3-70B-Instruct"
```

```
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/agents/
```
<img width="1002" alt="image"
src="https://github.com/user-attachments/assets/04808517-eded-4122-97f5-7e5142de9779"
/>



**Streaming**
- Chat Completion
<img width="902" alt="image"
src="https://github.com/user-attachments/assets/f477bc86-bd38-4729-b49e-a0a6ed3f835a"
/>

- Agent
<img width="916" alt="image"
src="https://github.com/user-attachments/assets/f4cc3417-23cd-46b1-953d-3a2271e79bbb"
/>


[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)
2025-02-11 18:31:35 -08:00
ehhuang
3922999118
sys_prompt support in Agent (#938)
# What does this PR do?

The current default system prompt for llama3.2 tends to overindex on
tool calling and doesn't work well when the prompt does not require tool
calling.

This PR adds an option to override the default system prompt, and
organizes tool-related configs into a new config object.

- [ ] Addresses issue (#issue)


## Test Plan


LLAMA_STACK_CONFIG=together pytest
\-\-inference\-model=meta\-llama/Llama\-3\.3\-70B\-Instruct -s -v
tests/client-sdk/agents/test_agents.py::test_override_system_message_behavior


## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2025-02-05 21:11:32 -08:00
Yuan Tang
34ab7a3b6c
Fix precommit check after moving to ruff (#927)
Lint check in main branch is failing. This fixes the lint check after we
moved to ruff in https://github.com/meta-llama/llama-stack/pull/921. We
need to move to a `ruff.toml` file as well as fixing and ignoring some
additional checks.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-02 06:46:45 -08:00
Hardik Shah
97eb3eecea
Fix Agents to support code and rag simultaneously (#908)
# What does this PR do?

Fixes a bug where agents were not working when both rag and
code-interpreter were added as tools.


## Test Plan

Added a new client_sdk test which tests for this scenario 
```
LLAMA_STACK_CONFIG=together pytest -s -v  tests/client-sdk -k 'test_rag_and_code_agent'
```

---------

Co-authored-by: Hardik Shah <hjshah@fb.com>
2025-01-30 17:09:34 -08:00
Zhonglin Han
229f0d5f7c
Agent response format (#660)
# What does this PR do?

Add response format for agents structured output.

- [ ] Using structured output for agents (interior_design app as an
example) (#issue)
https://github.com/meta-llama/llama-stack-apps/issues/122


## Test Plan
E2E test plan with llama-stack-apps interior_design

Please describe:
 Test ran: 

 - provide instructions so it can be reproduced.
 Start your distro:
llama stack run llama_stack/templates/fireworks/run.yaml --env
FIREWORKS_API_KEY=<API_KEY>
 
Run api test:
```PYTHONPATH=. python examples/interior_design_assistant/api.py localhost 5000 examples/interior_design_assistant/resources/documents/ examples/interior_design_assistant/resources/images/fireplaces```


## Sources
Results: 
https://github.com/meta-llama/llama-stack-client-python/pull/72

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2025-01-28 05:05:38 -08:00
Ashwin Bharambe
f3d8864c36 Rename builtin::memory -> builtin::rag 2025-01-22 20:22:51 -08:00
Ashwin Bharambe
08dcb9e31e Accept "query_config" params for the RAG tool 2025-01-22 16:42:36 -08:00
Ashwin Bharambe
07b87365ab
[inference api] modify content types so they follow a more standard structure (#841)
Some small updates to the inference types to make them more standard

Specifically:
- image data is now located in a "image" subkey
- similarly tool call data is located in a "tool_call" subkey

The pattern followed is `dict(type="foo", foo=<...>)`
2025-01-22 12:16:18 -08:00
Ashwin Bharambe
a63a43c646
[memory refactor][6/n] Update naming and routes (#839)
Making a few small naming changes as per feedback:

- RAGToolRuntime methods are called `insert` and `query` to keep them
more general
- The tool names are changed to non-namespaced forms
`insert_into_memory` and `query_from_memory`
- The REST endpoints are more REST-ful
2025-01-22 10:39:13 -08:00
Ashwin Bharambe
c9e5578151
[memory refactor][5/n] Migrate all vector_io providers (#835)
See https://github.com/meta-llama/llama-stack/issues/827 for the broader
design.

This PR finishes off all the stragglers and migrates everything to the
new naming.
2025-01-22 10:17:59 -08:00
Ashwin Bharambe
1a7490470a
[memory refactor][3/n] Introduce RAGToolRuntime as a specialized sub-protocol (#832)
See https://github.com/meta-llama/llama-stack/issues/827 for the broader
design.

Third part:
- we need to make `tool_runtime.rag_tool.query_context()` and
`tool_runtime.rag_tool.insert_documents()` methods work smoothly with
complete type safety. To that end, we introduce a sub-resource path
`tool-runtime/rag-tool/` and make changes to the resolver to make things
work.
- the PR updates the agents implementation to directly call these typed
APIs for memory accesses rather than going through the complex, untyped
"invoke_tool" API. the code looks much nicer and simpler (expectedly.)
- there are a number of hacks in the server resolver implementation
still, we will live with some and fix some

Note that we must make sure the client SDKs are able to handle this
subresource complexity also. Stainless has support for subresources, so
this should be possible but beware.

## Test Plan

Our RAG test is sad (doesn't actually test for actual RAG output) but I
verified that the implementation works. I will work on fixing the RAG
test afterwards.

```bash
pytest -s -v tests/agents/test_agents.py -k "rag and together" --safety-shield=meta-llama/Llama-Guard-3-8B
```
2025-01-22 10:04:16 -08:00
Dinesh Yeduguru
7fb2c1c48d
More idiomatic REST API (#765)
# What does this PR do?

This PR changes our API to follow more idiomatic REST API approaches of
having paths being resources and methods indicating the action being
performed.

Changes made to generator:
1) removed the prefix check of "get" as its not required and is actually
needed for other method types too
2) removed _ check on path since variables can have "_"



## Test Plan

LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v
tests/client-sdk/agents/test_agents.py
2025-01-15 13:20:09 -08:00
Ashwin Bharambe
d9d34433fc Update spec 2025-01-13 23:16:53 -08:00
Ashwin Bharambe
9a5803a429 move all implementations to use updated type 2025-01-13 23:16:53 -08:00
Dinesh Yeduguru
a5c57cd381
agents to use tools api (#673)
# What does this PR do?

PR #639 introduced the notion of Tools API and ability to invoke tools
through API just as any resource. This PR changes the Agents to start
using the Tools API to invoke tools. Major changes include:
1) Ability to specify tool groups with AgentConfig
2) Agent gets the corresponding tool definitions for the specified tools
and pass along to the model
3) Attachements are now named as Documents and their behavior is mostly
unchanged from user perspective
4) You can specify args that can be injected to a tool call through
Agent config. This is especially useful in case of memory tool, where
you want the tool to operate on a specific memory bank.
5) You can also register tool groups with args, which lets the agent
inject these as well into the tool call.
6) All tests have been migrated to use new tools API and fixtures
including client SDK tests
7) Telemetry just works with tools API because of our trace protocol
decorator


## Test Plan
```
pytest -s -v -k fireworks llama_stack/providers/tests/agents/test_agents.py  \
   --safety-shield=meta-llama/Llama-Guard-3-8B \
   --inference-model=meta-llama/Llama-3.1-8B-Instruct

pytest -s -v -k together  llama_stack/providers/tests/tools/test_tools.py \
   --safety-shield=meta-llama/Llama-Guard-3-8B \
   --inference-model=meta-llama/Llama-3.1-8B-Instruct

LLAMA_STACK_CONFIG="/Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml" pytest -v tests/client-sdk/agents/test_agents.py
```
run.yaml:
https://gist.github.com/dineshyv/0365845ad325e1c2cab755788ccc5994

Notebook:
https://colab.research.google.com/drive/1ck7hXQxRl6UvT-ijNRZ-gMZxH1G3cN2d?usp=sharing
2025-01-08 19:01:00 -08:00
Aidan Do
49ad168336
[#407] Agents: Avoid calling tools that haven't been explicitly enabled (#637)
# What does this PR do?

Contributes to issue (#407)

tl;dr - @subramen was getting a 500 error because llama-stack called
code_interpreter when it never was defined as a tool.

Prevents failures like:

<img width="544" alt="image"
src="https://github.com/user-attachments/assets/392683d2-4670-414c-aaba-07ebc006d748"
/>

```
# Server side
Traceback (most recent call last):
  File "/opt/conda/envs/llamastack-vllm-stack/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 206, in sse_generator
    async for item in await event_gen:
  File "/opt/conda/envs/llamastack-vllm-stack/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/agents/agents.py", line 138, in _create_agent_turn_streaming
    async for event in agent.create_and_execute_turn(request):
  File "/opt/conda/envs/llamastack-vllm-stack/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/agents/agent_instance.py", line 179, in create_and_execute_turn
    async for chunk in self.run(
  File "/opt/conda/envs/llamastack-vllm-stack/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/agents/agent_instance.py", line 252, in run
    async for res in self._run(
  File "/opt/conda/envs/llamastack-vllm-stack/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/agents/agent_instance.py", line 560, in _run
    result_messages = await execute_tool_call_maybe(
  File "/opt/conda/envs/llamastack-vllm-stack/lib/python3.10/site-packages/llama_stack/providers/impls/meta_reference/agents/agent_instance.py", line 824, in execute_tool_call_maybe
    assert name in tools_dict, f"Tool {name} not found"
AssertionError: Tool code_interpreter not found
```

Instead, if the model hallucinates, we just let it hallucinate and let
the client know.

<img width="544" alt="image"
src="https://github.com/user-attachments/assets/d2418583-d45a-48db-b476-45a584f2986f"
/>

## Test Plan

<details>
<summary>pytest llama_stack/providers/tests/agents/test_agents.py -k
ollama</summary>

```
llama stack build --template ollama --image-type conda 
conda activate llamastack-ollama
```

```
llama_stack/providers/tests/agents/test_agents.py ..Fss                                                                                          [100%]

======================================================================= FAILURES =======================================================================
_________________________________________ TestAgents.test_rag_agent_as_attachments[--ollama][ollama] __________________________________________
llama_stack/providers/tests/agents/test_agents.py:261: in test_rag_agent_as_attachments
    turn_response = [
llama_stack/providers/tests/agents/test_agents.py:261: in <listcomp>
    turn_response = [
llama_stack/providers/inline/agents/meta_reference/agents.py:153: in _create_agent_turn_streaming
    async for event in agent.create_and_execute_turn(request):
llama_stack/providers/inline/agents/meta_reference/agent_instance.py:179: in create_and_execute_turn
    async for chunk in self.run(
llama_stack/providers/inline/agents/meta_reference/agent_instance.py:250: in run
    async for res in self._run(
llama_stack/providers/inline/agents/meta_reference/agent_instance.py:363: in _run
    rag_context, bank_ids = await self._retrieve_context(
llama_stack/providers/inline/agents/meta_reference/agent_instance.py:698: in _retrieve_context
    bank_id = await self._ensure_memory_bank(session_id)
llama_stack/providers/inline/agents/meta_reference/agent_instance.py:653: in _ensure_memory_bank
    await self.memory_banks_api.register_memory_bank(
llama_stack/providers/utils/telemetry/trace_protocol.py:101: in async_wrapper
    result = await method(self, *args, **kwargs)
llama_stack/distribution/routers/routing_tables.py:312: in register_memory_bank
    raise ValueError(
E   ValueError: Embeddings are now served via Inference providers. Please upgrade your run.yaml to include inline::sentence-transformer as an additional inference provider. See https://github.com/meta-llama/llama-stack/blob/main/llama_stack/templates/together/run.yaml for an example.
=============================================================== short test summary info ================================================================
FAILED llama_stack/providers/tests/agents/test_agents.py::TestAgents::test_rag_agent_as_attachments[--ollama] - ValueError: Embeddings are now served via Inference providers. Please upgrade your run.yaml to include inline::sentence-transformer as an additiona...
========================================== 1 failed, 2 passed, 2 skipped, 20 deselected, 5 warnings in 14.24s ==========================================
```

Unrelated test is failing (also failing on main)
</details>

<details>
<summary>Manual</summary>

Using this client code:
7ebc257b27/client.py

<img width="544" alt="Screenshot 2024-12-16 at 17 41 31"
src="https://github.com/user-attachments/assets/7425deaf-c94a-4dda-a635-922728e373f1"
/>

</details>

## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2025-01-02 09:21:35 -08:00
Xi Yan
3c72c034e6
[remove import *] clean up import *'s (#689)
# What does this PR do?

- as title, cleaning up `import *`'s
- upgrade tests to make them more robust to bad model outputs
- remove import *'s in llama_stack/apis/* (skip __init__ modules)
<img width="465" alt="image"
src="https://github.com/user-attachments/assets/d8339c13-3b40-4ba5-9c53-0d2329726ee2"
/>

- run `sh run_openapi_generator.sh`, no types gets affected

## Test Plan

### Providers Tests

**agents**
```
pytest -v -s llama_stack/providers/tests/agents/test_agents.py -m "together" --safety-shield meta-llama/Llama-Guard-3-8B --inference-model meta-llama/Llama-3.1-405B-Instruct-FP8
```

**inference**
```bash
# meta-reference
torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py

# together
pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py

pytest ./llama_stack/providers/tests/inference/test_prompt_adapter.py 
```

**safety**
```
pytest -v -s llama_stack/providers/tests/safety/test_safety.py -m together --safety-shield meta-llama/Llama-Guard-3-8B
```

**memory**
```
pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m "sentence_transformers" --env EMBEDDING_DIMENSION=384
```

**scoring**
```
pytest -v -s -m llm_as_judge_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct
pytest -v -s -m basic_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py
pytest -v -s -m braintrust_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py
```


**datasetio**
```
pytest -v -s -m localfs llama_stack/providers/tests/datasetio/test_datasetio.py
pytest -v -s -m huggingface llama_stack/providers/tests/datasetio/test_datasetio.py
```


**eval**
```
pytest -v -s -m meta_reference_eval_together_inference llama_stack/providers/tests/eval/test_eval.py
pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio llama_stack/providers/tests/eval/test_eval.py
```

### Client-SDK Tests
```
LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v ./tests/client-sdk
```

### llama-stack-apps
```
PORT=5000
LOCALHOST=localhost

python -m examples.agents.hello $LOCALHOST $PORT
python -m examples.agents.inflation $LOCALHOST $PORT
python -m examples.agents.podcast_transcript $LOCALHOST $PORT
python -m examples.agents.rag_as_attachments $LOCALHOST $PORT
python -m examples.agents.rag_with_memory_bank $LOCALHOST $PORT
python -m examples.safety.llama_guard_demo_mm $LOCALHOST $PORT
python -m examples.agents.e2e_loop_with_custom_tools $LOCALHOST $PORT

# Vision model
python -m examples.interior_design_assistant.app
python -m examples.agent_store.app $LOCALHOST $PORT
```

### CLI
```
which llama
llama model prompt-format -m Llama3.2-11B-Vision-Instruct
llama model list
llama stack list-apis
llama stack list-providers inference

llama stack build --template ollama --image-type conda
```

### Distributions Tests
**ollama**
```
llama stack build --template ollama --image-type conda
ollama run llama3.2:1b-instruct-fp16
llama stack run ./llama_stack/templates/ollama/run.yaml --env INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct
```

**fireworks**
```
llama stack build --template fireworks --image-type conda
llama stack run ./llama_stack/templates/fireworks/run.yaml
```

**together**
```
llama stack build --template together --image-type conda
llama stack run ./llama_stack/templates/together/run.yaml
```

**tgi**
```
llama stack run ./llama_stack/templates/tgi/run.yaml --env TGI_URL=http://0.0.0.0:5009 --env INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
```

## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-12-27 15:45:44 -08:00