Commit graph

1332 commits

Author SHA1 Message Date
Botao Chen
357141f6de refine 2025-03-09 18:14:26 -07:00
Botao Chen
41f86b5ce6 refine 2025-03-09 17:30:23 -07:00
Botao Chen
51282456b9 refine 2025-03-09 17:17:18 -07:00
Botao Chen
0b92fef3ba refine 2025-03-03 23:53:27 -08:00
Botao Chen
840fd353f7 init commit 2025-03-03 23:41:11 -08:00
Dinesh Yeduguru
7f9b767277
fix: check conda env name using basepath in exec.py (#1301)
# What does this PR do?
check conda env name using basepath in exec.py
The current logic for finding conda prefix does a `endswith` check with
just the conda env name, but this will cause us to match incorrect if
there is a different conda env which ends with same suffix. In my case,
i had stack and llama-stack as the two conda envs.

## Test Plan
llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml
2025-02-27 23:07:23 -08:00
Hardik Shah
8efa53daf1
fix: Agent telemetry inputs/outputs should be structured (#1302)
Original telemetry outputs for agent turns look like this. 
Note: how output was a `str(message)` making it difficult to read them
back for downstream tasks ( eg. building eval datasets )
```
{
│   │   'input': [
│   │   │   '{"role":"system","content":"You are a helpful assistant. Use search tool to answer the questions. "}',
│   │   │   '{"role":"user","content":"Which teams played in the NBA western conference finals of 2024","context":null}'
│   │   ],
│   │   'output': "content:  tool_calls: [ToolCall(call_id='8b7294ec-a83f-4798-ad8f-6bed662f08b6', tool_name=<BuiltinTool.brave_search: 'brave_search'>, arguments={'query': 'NBA Western Conference Finals 2024 teams'})]"
│   },
``` 

Updated the outputs to be structured .

## Test 

```python
import uuid

from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig

model_id = "meta-llama/Llama-3.1-8B-Instruct"
agent_config = AgentConfig(
    model=model_id,
    instructions="You are a helpful assistant who will use the web search tools to help with answering questions.\nOnly provide final answer in short without writing full sentences. Use web search",
    toolgroups=["builtin::websearch"],
    enable_session_persistence=True,
)

agent = Agent(client, agent_config)

session_id = agent.create_session(uuid.uuid4().hex)
response = agent.create_turn(
    messages=[
        {
            "role": "user",
            "content": "latest news about llama stack",
        }
    ],
    session_id=session_id,
    stream=False,
)

pprint(response)
```
Output: 
```
Turn(
│   input_messages=[UserMessage(content='latest news about llama stack', role='user', context=None)],
│   output_message=CompletionMessage(
│   │   content="The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.",
│   │   role='assistant',
│   │   stop_reason='end_of_turn',
│   │   tool_calls=[]
│   ),
│   session_id='77379546-4598-485a-b4f4-84e5da28c513',
│   started_at=datetime.datetime(2025, 2, 27, 11, 2, 43, 915243, tzinfo=TzInfo(-08:00)),
│   steps=[
│   │   InferenceStep(
│   │   │   api_model_response=CompletionMessage(
│   │   │   │   content='',
│   │   │   │   role='assistant',
│   │   │   │   stop_reason='end_of_turn',
│   │   │   │   tool_calls=[
│   │   │   │   │   ToolCall(
│   │   │   │   │   │   arguments={'query': 'latest news llama stack'},
│   │   │   │   │   │   call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b',
│   │   │   │   │   │   tool_name='brave_search'
│   │   │   │   │   )
│   │   │   │   ]
│   │   │   ),
│   │   │   step_id='81c16bd3-eb00-4721-8edc-f386e07391a3',
│   │   │   step_type='inference',
│   │   │   turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 11, 2, 44, 637149, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 11, 2, 43, 915831, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ToolExecutionStep(
│   │   │   step_id='4782d609-a62e-45f5-8d2a-25a43db46288',
│   │   │   step_type='tool_execution',
│   │   │   tool_calls=[
│   │   │   │   ToolCall(
│   │   │   │   │   arguments={'query': 'latest news llama stack'},
│   │   │   │   │   call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b',
│   │   │   │   │   tool_name='brave_search'
│   │   │   │   )
│   │   │   ],
│   │   │   tool_responses=[
│   │   │   │   ToolResponse(
│   │   │   │   │   call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b',
│   │   │   │   │   content='{"query": "latest news llama stack", "top_k": [{"title": "Llama 3.2: Revol. .......  Hacker News.", "score": 0.6186197, "raw_content": null}]}',
│   │   │   │   │   tool_name='brave_search',
│   │   │   │   │   metadata=None
│   │   │   │   )
│   │   │   ],
│   │   │   turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 11, 2, 46, 272176, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 11, 2, 44, 640743, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   InferenceStep(
│   │   │   api_model_response=CompletionMessage(
│   │   │   │   content="The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.",
│   │   │   │   role='assistant',
│   │   │   │   stop_reason='end_of_turn',
│   │   │   │   tool_calls=[]
│   │   │   ),
│   │   │   step_id='37994419-5da3-4e84-a010-8d9b85366262',
│   │   │   step_type='inference',
│   │   │   turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 11, 2, 48, 961275, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 11, 2, 46, 273168, tzinfo=TzInfo(-08:00))
│   │   )
│   ],
│   turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45',
│   completed_at=datetime.datetime(2025, 2, 27, 11, 2, 48, 962318, tzinfo=TzInfo(-08:00)),
│   output_attachments=[]
)

```

## Check for Telemetry 
```python 

agent_logs = []
for span in client.telemetry.query_spans(
    attribute_filters=[
      {"key": "session_id", "op": "eq", "value": session_id},
    ],
    attributes_to_return=['input', 'output'],
):
    agent_logs.append(span.attributes)

pprint(json.loads(agent_logs[-1]['output']))
```
```
{
│   'content': "The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.",
│   'tool_calls': []
}
```
2025-02-27 23:06:37 -08:00
ehhuang
caffafd101
feat: update the default system prompt for 3.2/3.3 models (#1310)
# Summary:
The current prompt doesn't work well and tend to overindex on tool
calling. This PR is not perfect, but should be an improvement over the
current prompt. We can keep iterating.

# Test Plan:

Ran on a (small) eval with 20 HotpotQA examples.

With current prompt:
https://gist.github.com/ehhuang/9f967e62751907165eb13781ea968f5c
{
│ 'basic::equality': {'accuracy': {'accuracy': 0.2, 'num_correct': 4.0,
'num_total': 20}},
│   'F1ScoringFn': {
│   │   'f1_average': 0.25333333333333335,
│   │   'precision_average': 0.23301767676767676,
│   │   'recall_average': 0.375
│   }
}


num_tool_calls=[5, 5, 5, 5, 5, 5, 2, 5, 5, 5, 5, 5, 2, 2, 1, 1, 2, 1, 2,
2]
num_examples_with_tool_call=20
num_examples_with_pythontag=0


#########################################################
With new prompt:
https://gist.github.com/ehhuang/6e4a8ecf54db68922c2be8700056f962

{
│ 'basic::equality': {'accuracy': {'accuracy': 0.25, 'num_correct': 5.0,
'num_total': 20}},
│   'F1ScoringFn': {
│   │   'f1_average': 0.35579260478321006,
│   │   'precision_average': 0.32030238933180105,
│   │   'recall_average': 0.6091666666666666
│   }
}


num_tool_calls=[2, 1, 1, 5, 5, 5, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 3,
2]
num_examples_with_tool_call=20
num_examples_with_pythontag=0


The answers have higher recall, and make fewer tool calls. Note that
these were run with max_infer_iter=5, so the current prompt hits this
limit more often, and without the limit, someitmes goes into infinite
tool calling loop.

The data here is with 3.3-70B. Results are equally poor with either
prompt with 3.2-3B ~30 recall.
2025-02-27 23:05:42 -08:00
Ashwin Bharambe
ece354eedd test: dont hardcode faiss as provider in the tests please 2025-02-27 22:54:34 -08:00
Ashwin Bharambe
4c8a0fa8dc fix: ensure ollama embedding model is registered properly in the template 2025-02-27 22:49:06 -08:00
Hardik Shah
999195fe5b
fix: [Litellm]Do not swallow first token (#1316)
`ChatCompletionResponseEventType: start` is ignored and not yielded in
the agent_instance as we expect that to not have any content.

However, litellm sends first event as `ChatCompletionResponseEventType:
start` with content ( which was the first token that we were skipping )

```
LLAMA_STACK_CONFIG=dev pytest -s -v tests/client-sdk/agents/test_agents.py --inference-model "openai/gpt-4o-mini" -k test_agent_simple
``` 
This was failing before ( since the word hello was not in the final
response )
2025-02-27 20:53:47 -08:00
Xi Yan
7780fc92d5
fix: update getting_started notebook to pass nbeval (#1318)
# What does this PR do?

- See
3796667776
- Together's structured decoding API is flaky, add skip to cell
- Enable cell 21 to pass cell 21-23

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan


<img width="652" alt="image"
src="https://github.com/user-attachments/assets/a1e4b94b-c1ce-4869-ba0d-0860bfe33460"
/>


[//]: # (## Documentation)
2025-02-27 23:13:00 -05:00
Yuan Tang
6824d23dc9
test: Only run embedding tests for remote::nvidia (#1317)
This fixes release build failure
3796497240:

```
=================================== FAILURES ===================================
______ test_embedding_truncation_error[txt=8B:emb=MiniLM-long-text-None] _______
llama-stack/tests/client-sdk/inference/test_embedding.py:166: in test_embedding_truncation_error
    with pytest.raises(BadRequestError) as excinfo:
E   Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
______ test_embedding_truncation_error[txt=8B:emb=MiniLM-long-text-none] _______
llama-stack/tests/client-sdk/inference/test_embedding.py:166: in test_embedding_truncation_error
    with pytest.raises(BadRequestError) as excinfo:
E   Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
_______ test_embedding_truncation_error[txt=8B:emb=MiniLM-long-str-None] _______
llama-stack/tests/client-sdk/inference/test_embedding.py:166: in test_embedding_truncation_error
    with pytest.raises(BadRequestError) as excinfo:
E   Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
_______ test_embedding_truncation_error[txt=8B:emb=MiniLM-long-str-none] _______
llama-stack/tests/client-sdk/inference/test_embedding.py:166: in test_embedding_truncation_error
    with pytest.raises(BadRequestError) as excinfo:
E   Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
_________ test_embedding_text_truncation_error[txt=8B:emb=MiniLM-NONE] _________
llama-stack/tests/client-sdk/inference/test_embedding.py:223: in test_embedding_text_truncation_error
    with pytest.raises(BadRequestError) as excinfo:
E   Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
_________ test_embedding_text_truncation_error[txt=8B:emb=MiniLM-END] __________
llama-stack/tests/client-sdk/inference/test_embedding.py:223: in test_embedding_text_truncation_error
    with pytest.raises(BadRequestError) as excinfo:
E   Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
________ test_embedding_text_truncation_error[txt=8B:emb=MiniLM-START] _________
llama-stack/tests/client-sdk/inference/test_embedding.py:223: in test_embedding_text_truncation_error
    with pytest.raises(BadRequestError) as excinfo:
E   Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
_________ test_embedding_text_truncation_error[txt=8B:emb=MiniLM-left] _________
llama-stack/tests/client-sdk/inference/test_embedding.py:223: in test_embedding_text_truncation_error
    with pytest.raises(BadRequestError) as excinfo:
E   Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
________ test_embedding_text_truncation_error[txt=8B:emb=MiniLM-right] _________
llama-stack/tests/client-sdk/inference/test_embedding.py:223: in test_embedding_text_truncation_error
    with pytest.raises(BadRequestError) as excinfo:
E   Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
=========================== short test summary info ============================
FAILED llama-stack/tests/client-sdk/inference/test_embedding.py::test_embedding_truncation_error[txt=8B:emb=MiniLM-long-text-None] - Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
FAILED llama-stack/tests/client-sdk/inference/test_embedding.py::test_embedding_truncation_error[txt=8B:emb=MiniLM-long-text-none] - Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
FAILED llama-stack/tests/client-sdk/inference/test_embedding.py::test_embedding_truncation_error[txt=8B:emb=MiniLM-long-str-None] - Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
FAILED llama-stack/tests/client-sdk/inference/test_embedding.py::test_embedding_truncation_error[txt=8B:emb=MiniLM-long-str-none] - Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
FAILED llama-stack/tests/client-sdk/inference/test_embedding.py::test_embedding_text_truncation_error[txt=8B:emb=MiniLM-NONE] - Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
FAILED llama-stack/tests/client-sdk/inference/test_embedding.py::test_embedding_text_truncation_error[txt=8B:emb=MiniLM-END] - Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
FAILED llama-stack/tests/client-sdk/inference/test_embedding.py::test_embedding_text_truncation_error[txt=8B:emb=MiniLM-START] - Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
FAILED llama-stack/tests/client-sdk/inference/test_embedding.py::test_embedding_text_truncation_error[txt=8B:emb=MiniLM-left] - Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
FAILED llama-stack/tests/client-sdk/inference/test_embedding.py::test_embedding_text_truncation_error[txt=8B:emb=MiniLM-right] - Failed: DID NOT RAISE <class 'llama_stack_client.BadRequestError'>
= 9 failed, 48 passed, 2 skipped, 3 deselected, 3 xfailed, 1 xpassed, 121 warnings in 90.16s (0:01:30) =
Error: Process completed with exit code 1.
```

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-27 22:35:52 -05:00
Yuan Tang
a9f5c5bfca
fix: Incorrect import path for print_subcommand_description() (#1315)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-27 18:50:41 -08:00
Yuan Tang
f4df3a76d9
fix: Incorrect import path for print_subcommand_description() (#1314)
# What does this PR do?

Missed this one additional import in
https://github.com/meta-llama/llama-stack/pull/1313

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-27 18:35:49 -08:00
Yuan Tang
3567274183
fix: Incorrect import path for print_subcommand_description() (#1313)
# What does this PR do?

This fixes release build failure:
3796356500

```
+ llama model prompt-format -m Llama3.2-11B-Vision-Instruct
Traceback (most recent call last):
  File "/tmp/tmp.PXMDlmD0x5/.venv/bin/llama", line 4, in <module>
    from llama_stack.cli.llama import main
  File "/tmp/tmp.PXMDlmD0x5/.venv/lib/python3.10/site-packages/llama_stack/cli/llama.py", line 10, in <module>
    from .model import ModelParser
  File "/tmp/tmp.PXMDlmD0x5/.venv/lib/python3.10/site-packages/llama_stack/cli/model/__init__.py", line 7, in <module>
    from .model import ModelParser  # noqa
  File "/tmp/tmp.PXMDlmD0x5/.venv/lib/python3.10/site-packages/llama_stack/cli/model/model.py", line 16, in <module>
    from llama_stack.cli.utils import print_subcommand_description
ModuleNotFoundError: No module named 'llama_stack.cli.utils'
```

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-27 21:24:01 -05:00
Xi Yan
076d2f349d
fix: litellm tool call parsing event type to in_progress (#1312)
# What does this PR do?

- Test with script:
https://gist.github.com/yanxi0830/64699f3604766ac2319421b750c5bf9c

- Agent with tool calls does not get correctly parsed with LiteLLM
provider b/c we skip processing
`ChatCompletionResponseEventType.complete`.
- However, LiteLLM spits out event_type="complete" with ToolCallDelta


2f7683bc5f/llama_stack/providers/inline/agents/meta_reference/agent_instance.py (L570-L577)


- Llama Model
```
ChatCompletionResponseStreamChunk(
│   event=Event(
│   │   delta=ToolCallDelta(
│   │   │   parse_status='succeeded',
│   │   │   tool_call=ToolCall(
│   │   │   │   arguments={'kind': 'pod', 'namespace': 'openshift-lightspeed'},
│   │   │   │   call_id='call_tIjWTUdsQXhQ2XHC5ke4EQY5',
│   │   │   │   tool_name='get_object_namespace_list'
│   │   │   ),
│   │   │   type='tool_call'
│   │   ),
│   │   event_type='progress',
│   │   logprobs=None,
│   │   stop_reason='end_of_turn'
│   ),
│   metrics=None
)
ChatCompletionResponseStreamChunk(
│   event=Event(
│   │   delta=TextDelta(text='', type='text'),
│   │   event_type='complete',
│   │   logprobs=None,
│   │   stop_reason='end_of_turn'
│   ),
│   metrics=None
)
```

- LiteLLM model
```
ChatCompletionResponseStreamChunk(
│   event=Event(
│   │   delta=ToolCallDelta(
│   │   │   parse_status='succeeded',
│   │   │   tool_call=ToolCall(
│   │   │   │   arguments={'kind': 'pod', 'namespace': 'openshift-lightspeed'},
│   │   │   │   call_id='call_tIjWTUdsQXhQ2XHC5ke4EQY5',
│   │   │   │   tool_name='get_object_namespace_list'
│   │   │   ),
│   │   │   type='tool_call'
│   │   ),
│   │   event_type='complete',
│   │   logprobs=None,
│   │   stop_reason='end_of_turn'
│   ),
│   metrics=None
)
ChatCompletionResponseStreamChunk(
│   event=Event(
│   │   delta=TextDelta(text='', type='text'),
│   │   event_type='complete',
│   │   logprobs=None,
│   │   stop_reason='end_of_turn'
│   ),
│   metrics=None
)
```


[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

- Test with script:
https://gist.github.com/yanxi0830/64699f3604766ac2319421b750c5bf9c


[//]: # (## Documentation)
2025-02-27 18:00:27 -08:00
Hardik Shah
2f7683bc5f
fix: Structured outputs for recursive models (#1311)
Handle recursive nature in the structured response_formats. 

Update test to include 1 nested model.

```
 LLAMA_STACK_CONFIG=dev pytest -s -v tests/client-sdk/inference/test_text_inference.py --inference-model "openai/gpt-4o-mini" -k test_text_chat_completion_structured_output
```

---------

Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-02-27 17:31:53 -08:00
Reid
94e2186bb8
chore: add subcommands description in help (#1219)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

```
before:
$ llama
usage: llama [-h] {model,stack,download,verify-download} ...

Welcome to the Llama CLI

options:
  -h, --help            show this help message and exit

subcommands:
  {model,stack,download,verify-download}

$ llama model --help
usage: llama model [-h] {download,list,prompt-format,describe,verify-download,remove} ...

Work with llama models

options:
  -h, --help            show this help message and exit

model_subcommands:
  {download,list,prompt-format,describe,verify-download,remove}

$ llama stack --help
usage: llama stack [-h] [--version] {build,list-apis,list-providers,run} ...

Operations for the Llama Stack / Distributions

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

stack_subcommands:
  {build,list-apis,list-providers,run}

===================
after:
$ llama
usage: llama [-h] {model,stack,download,verify-download} ...

Welcome to the Llama CLI

options:
  -h, --help            show this help message and exit

subcommands:
  {model,stack,download,verify-download}

  model                 Work with llama models
  stack                 Operations for the Llama Stack / Distributions
  download              Download a model from llama.meta.com or Hugging Face Hub
  verify-download       Verify integrity of downloaded model files

$ llama model --help
usage: llama model [-h] {download,list,prompt-format,describe,verify-download,remove} ...

Work with llama models

options:
  -h, --help            show this help message and exit

model_subcommands:
  {download,list,prompt-format,describe,verify-download,remove}

  download              Download a model from llama.meta.com or Hugging Face Hub
  list                  Show available llama models
  prompt-format         Show llama model message formats
  describe              Show details about a llama model
  verify-download       Verify the downloaded checkpoints' checksums for models downloaded from Meta
  remove                Remove the downloaded llama model

$ llama stack --help
usage: llama stack [-h] [--version] {build,list-apis,list-providers,run} ...

Operations for the Llama Stack / Distributions

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

stack_subcommands:
  {build,list-apis,list-providers,run}

  build                 Build a Llama stack container
  list-apis             List APIs part of the Llama Stack implementation
  list-providers        Show available Llama Stack Providers for an API
  run                   Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

---------

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-27 17:00:27 -08:00
Matthew Farrellee
e28cedd833
feat: add nvidia embedding implementation for new signature, task_type, output_dimention, text_truncation (#1213)
# What does this PR do?

updates nvidia inference provider's embedding implementation to use new
signature

add support for task_type, output_dimensions, text_truncation parameters

## Test Plan

`LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v
tests/client-sdk/inference/test_embedding.py --embedding-model
baai/bge-m3`
2025-02-27 16:58:11 -08:00
Luis Tomas Bolivar
73c6f6126f
fix: Avoid unexpected keyword argument for sentence_transformers (#1269)
Now that remote-vllm include inline::sentence_transformers there is an
issue building the image:
Error building stack:
SentenceTransformersInferenceConfig.sample_run_config() got an
unexpected keyword argument '__distro_dir__'

To avoid that issue this fix extends the sample_run_config to accept
extra kwargs
2025-02-27 16:47:26 -08:00
Reid
c2d2a80b0a
docs: update the output of llama-stack-client models list (#1271)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-27 16:46:38 -08:00
Yuan Tang
264c2c46db
build: Add dotenv file for running tests with uv (#1251)
This will be useful for testing instead of having to manually pass them
every time.

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-27 16:42:55 -08:00
Ashwin Bharambe
04de2f84e9
fix: register provider model name and HF alias in run.yaml (#1304)
Each model known to the system has two identifiers: 

- the `provider_resource_id` (what the provider calls it) -- e.g.,
`accounts/fireworks/models/llama-v3p1-8b-instruct`
- the `identifier` (`model_id`) under which it is registered and gets
routed to the appropriate provider.

We have so far used the HuggingFace repo alias as the standardized
identifier you can use to refer to the model. So in the above example,
we'd use `meta-llama/Llama-3.1-8B-Instruct` as the name under which it
gets registered. This makes it convenient for users to refer to these
models across providers.

However, we forgot to register the _actual_ provider model ID also. You
should be able to route via `provider_resource_id` also, of course.

This change fixes this (somewhat grave) omission.

*Note*: this change is additive -- more aliases work now compared to
before.

## Test Plan

Run the following for distro=(ollama fireworks together)
```
LLAMA_STACK_CONFIG=$distro \
   pytest -s -v tests/client-sdk/inference/test_text_inference.py \
   --inference-model=meta-llama/Llama-3.1-8B-Instruct --vision-inference-model=""
```
2025-02-27 16:39:23 -08:00
Ashwin Bharambe
c54164556a
fix: update notebooks to avoid using the nutsy --image-name __system__ thing (#1308)
The `--image-name __system__` thing was a hack and a bad one at that.
The actual intent was to somehow automatically detect the notebook
environment so we could avoid unnecessarily confusing things in the
llama stack build cmd-line. But I failed which led us to use the backup
`__system__` thing.

Let's just do the simple thing.

Note that `build_venv.sh` I haven't changed for now (so it still honors
the __system__ special name just that no new user should use it.)

## Test Plan

Open the notebooks from this branch in Colab (see example url below) and
ensure the builds work.


https://colab.research.google.com/github/meta-llama/llama-stack/blob/foo/docs/getting_started.ipynb

In the notebook, install llama-stack from this branch directly using:

```
!pip install -U https://github.com/meta-llama/llama-stack/archive/refs/heads/foo.zip
```

Verify that `!UV_SYSTEM_PYTHON=1 llama stack build --template together
--image-type venv` afterwards succeeds and the library client
initialization also works.
2025-02-27 16:39:04 -08:00
ehhuang
a34f3aafcf
fix: don't include tool args not in the function definition (#1307)
# Summary:
Right now we would include toolgroup args when we encode messages with
tool_calls, which is confusing the model since they not in the function
description (see test plan for example).

# Test Plan:
Add a print statement before raw prompt is sent to providers (no good
way to test this currently)

Before:
```
cated in the same neighborhood?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n[knowledge_search(query="Laleli Mosque and Esma Sultan Mansion same neighborhood", vector_db_ids=["829a68735d744dc3830409dcc782964a"])]<|eot_id|><|start_header_id|>ipython<|end_header_id|>\n\nknowledge_search tool found 5 chunks:\nBEGIN of
```
Note the extra `vector_db_ids`

After
```
>user<|end_header_id|>\n\nAre the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n[knowledge_search(query="Laleli Mosque and Esma Sultan Mansion same neighborhood")]<|eot_id|><|start_header_id|>ipython<|end_header_id|>\n\nknowledge_search tool found
```
2025-02-27 16:25:30 -08:00
Xi Yan
663c6b0537
fix: duplicate ToolResponseMessage in Turn message history (#1305)
# What does this PR do?

- Reproduce with:
https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/e2e_loop_with_client_tools.py

- **Root cause**: when we have ToolResponseMessage as part of Turn, we
will create duplicate ToolResponseMessage in the conversation history
when getting messages from a Turn.
- Fix: avoid adding duplicate ToolResponseMessage from a turn's
input_messages.
- If it is part of a Turn's steps, only add it when processing the
steps.
   - If it is not part of a Turn's steps, add it. 

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

```
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/agents/test_agents.py --inference-model meta-llama/Llama-3.1-8B-Instruct
```


```
python -m examples.agents.e2e_loop_with_client_tools localhost 8321 
```

```python
Turn(
│   input_messages=[
│   │   UserMessage(
│   │   │   content='What was the closing price of Google stock (ticker symbol GOOG) for 2023 ?',
│   │   │   role='user',
│   │   │   context=None
│   │   ),
│   │   ToolResponseMessage(
│   │   │   call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94',
│   │   │   content='"[{\\"(\'Year\', \'\')\\":2023,\\"(\'Close\', \'GOOG\')\\":140.4254302979}]"',
│   │   │   role='tool',
│   │   │   tool_name='get_ticker_data'
│   │   )
│   ],
│   output_message=CompletionMessage(
│   │   content='Note: The actual closing price for 2023 may not be available or may be different from the result obtained above. The result is based on a hypothetical call to the get_ticker_data function.',
│   │   role='assistant',
│   │   stop_reason='end_of_turn',
│   │   tool_calls=[]
│   ),
│   session_id='4c791107-f0d8-456e-a27f-aa2fdc72b871',
│   started_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 412928, tzinfo=TzInfo(-08:00)),
│   steps=[
│   │   ShieldCallStep(
│   │   │   step_id='e0514587-b7d6-4bba-8609-8e05a3a46d8a',
│   │   │   step_type='shield_call',
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 858382, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 425204, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   ),
│   │   InferenceStep(
│   │   │   api_model_response=CompletionMessage(
│   │   │   │   content='',
│   │   │   │   role='assistant',
│   │   │   │   stop_reason='end_of_turn',
│   │   │   │   tool_calls=[
│   │   │   │   │   ToolCall(
│   │   │   │   │   │   arguments={
│   │   │   │   │   │   │   'ticker_symbol': 'GOOG',
│   │   │   │   │   │   │   'start': '2023-01-01',
│   │   │   │   │   │   │   'end': '2023-12-31'
│   │   │   │   │   │   },
│   │   │   │   │   │   call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94',
│   │   │   │   │   │   tool_name='get_ticker_data'
│   │   │   │   │   )
│   │   │   │   ]
│   │   │   ),
│   │   │   step_id='a3ceec6a-f149-49d5-a1c2-db461e3f6e9f',
│   │   │   step_type='inference',
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 26, 910179, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 871130, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ShieldCallStep(
│   │   │   step_id='f9339865-96ca-4425-af42-a87bab343e24',
│   │   │   step_type='shield_call',
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 383013, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 26, 944012, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   ),
│   │   ToolExecutionStep(
│   │   │   step_id='e317b74a-c4f3-4845-99a3-7d93aa6ea6c8',
│   │   │   step_type='tool_execution',
│   │   │   tool_calls=[
│   │   │   │   ToolCall(
│   │   │   │   │   arguments={'ticker_symbol': 'GOOG', 'start': '2023-01-01', 'end': '2023-12-31'},
│   │   │   │   │   call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94',
│   │   │   │   │   tool_name='get_ticker_data'
│   │   │   │   )
│   │   │   ],
│   │   │   tool_responses=[
│   │   │   │   ToolResponse(
│   │   │   │   │   call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94',
│   │   │   │   │   content='"[{\\"(\'Year\', \'\')\\":2023,\\"(\'Close\', \'GOOG\')\\":140.4254302979}]"',
│   │   │   │   │   tool_name='get_ticker_data',
│   │   │   │   │   metadata=None
│   │   │   │   )
│   │   │   ],
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 718810, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 26, 943792, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ShieldCallStep(
│   │   │   step_id='c4236616-db89-4c04-ad04-f51cfb726385',
│   │   │   step_type='shield_call',
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 958946, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 732680, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   ),
│   │   InferenceStep(
│   │   │   api_model_response=CompletionMessage(
│   │   │   │   content='Note: The actual closing price for 2023 may not be available or may be different from the result obtained above. The result is based on a hypothetical call to the get_ticker_data function.',
│   │   │   │   role='assistant',
│   │   │   │   stop_reason='end_of_turn',
│   │   │   │   tool_calls=[]
│   │   │   ),
│   │   │   step_id='3386f896-2026-41e4-a60f-f6f3c3981cf6',
│   │   │   step_type='inference',
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 74750, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 970724, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ShieldCallStep(
│   │   │   step_id='bc57ac8c-f94e-4758-bf1a-0dd734eca1cf',
│   │   │   step_type='shield_call',
│   │   │   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 443016, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 86726, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   )
│   ],
│   turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc',
│   completed_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 459456, tzinfo=TzInfo(-08:00)),
│   output_attachments=[]
)
```

```python
Turn(
│   input_messages=[
│   │   UserMessage(content='What is 40+30?', role='user', context=None),
│   │   ToolResponseMessage(
│   │   │   call_id='8e54aca9-244d-44ca-ada0-0365090e8622',
│   │   │   content='{"success": true, "result": 70.0}',
│   │   │   role='tool',
│   │   │   tool_name='calculator'
│   │   )
│   ],
│   output_message=CompletionMessage(
│   │   content='The result of the calculation is 70.',
│   │   role='assistant',
│   │   stop_reason='end_of_turn',
│   │   tool_calls=[]
│   ),
│   session_id='4c791107-f0d8-456e-a27f-aa2fdc72b871',
│   started_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 156903, tzinfo=TzInfo(-08:00)),
│   steps=[
│   │   ShieldCallStep(
│   │   │   step_id='17b6b645-31cc-4be9-a758-a4f3b741ced9',
│   │   │   step_type='shield_call',
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 780564, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 174515, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   ),
│   │   InferenceStep(
│   │   │   api_model_response=CompletionMessage(
│   │   │   │   content='',
│   │   │   │   role='assistant',
│   │   │   │   stop_reason='end_of_turn',
│   │   │   │   tool_calls=[
│   │   │   │   │   ToolCall(
│   │   │   │   │   │   arguments={'x': 40.0, 'y': 30.0, 'operation': 'add'},
│   │   │   │   │   │   call_id='8e54aca9-244d-44ca-ada0-0365090e8622',
│   │   │   │   │   │   tool_name='calculator'
│   │   │   │   │   )
│   │   │   │   ]
│   │   │   ),
│   │   │   step_id='f59e951a-2b75-497d-a075-ec9aad9aad12',
│   │   │   step_type='inference',
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 141869, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 792047, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ShieldCallStep(
│   │   │   step_id='efafa0cf-23b9-4a90-8350-3a186d80925d',
│   │   │   step_type='shield_call',
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 766293, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 177473, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   ),
│   │   ToolExecutionStep(
│   │   │   step_id='877cfbe7-57a8-4056-9c29-49aa38dd337c',
│   │   │   step_type='tool_execution',
│   │   │   tool_calls=[
│   │   │   │   ToolCall(
│   │   │   │   │   arguments={'x': 40.0, 'y': 30.0, 'operation': 'add'},
│   │   │   │   │   call_id='8e54aca9-244d-44ca-ada0-0365090e8622',
│   │   │   │   │   tool_name='calculator'
│   │   │   │   )
│   │   │   ],
│   │   │   tool_responses=[
│   │   │   │   ToolResponse(
│   │   │   │   │   call_id='8e54aca9-244d-44ca-ada0-0365090e8622',
│   │   │   │   │   content='{"success": true, "result": 70.0}',
│   │   │   │   │   tool_name='calculator',
│   │   │   │   │   metadata=None
│   │   │   │   )
│   │   │   ],
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 930899, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 177202, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ShieldCallStep(
│   │   │   step_id='d47c6160-45d9-47c1-8e39-2faae65ee468',
│   │   │   step_type='shield_call',
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 3, 510402, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 949433, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   ),
│   │   InferenceStep(
│   │   │   api_model_response=CompletionMessage(
│   │   │   │   content='The result of the calculation is 70.',
│   │   │   │   role='assistant',
│   │   │   │   stop_reason='end_of_turn',
│   │   │   │   tool_calls=[]
│   │   │   ),
│   │   │   step_id='660ba1cc-770e-471c-bf6e-11e103d74443',
│   │   │   step_type='inference',
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 4, 814944, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 3, 521309, tzinfo=TzInfo(-08:00))
│   │   ),
│   │   ShieldCallStep(
│   │   │   step_id='4dab8bb0-7d38-4465-ae1a-10069de2b3d1',
│   │   │   step_type='shield_call',
│   │   │   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   │   │   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 5, 428561, tzinfo=TzInfo(-08:00)),
│   │   │   started_at=datetime.datetime(2025, 2, 27, 14, 0, 4, 825970, tzinfo=TzInfo(-08:00)),
│   │   │   violation=None
│   │   )
│   ],
│   turn_id='4daff286-f703-417e-a5dc-0e158582bbec',
│   completed_at=datetime.datetime(2025, 2, 27, 14, 0, 5, 462823, tzinfo=TzInfo(-08:00)),
│   output_attachments=[]
)
```


[//]: # (## Documentation)
2025-02-27 15:06:47 -08:00
Ashwin Bharambe
6e8dfa727d fix: precommits ugh why wont they run correctly because they dont have the right dependencies 2025-02-27 15:02:04 -08:00
Ashwin Bharambe
4780223544 fix: groq now depends on litellm 2025-02-27 14:07:12 -08:00
Ashwin Bharambe
928a39d17b
feat(providers): Groq now uses LiteLLM openai-compat (#1303)
Groq has never supported raw completions anyhow. So this makes it easier
to switch it to LiteLLM. All our test suite passes.

I also updated all the openai-compat providers so they work with api
keys passed from headers. `provider_data`

## Test Plan

```bash
LLAMA_STACK_CONFIG=groq \
   pytest -s -v tests/client-sdk/inference/test_text_inference.py \
   --inference-model=groq/llama-3.3-70b-versatile --vision-inference-model=""
```

Also tested (openai, anthropic, gemini) providers. No regressions.
2025-02-27 13:16:50 -08:00
Xi Yan
564f0e5f93
fix: Revert "chore: remove vector_db_id from AgentSessionInfo" (#1299)
Reverts meta-llama/llama-stack#1296

This change breaks test: `session_info.vector_db_id` is actually used
```
pytest -v tests/client-sdk/agents/test_agents.py::test_rag_and_code_agent --inference-model meta-llama/Llama-3.1-8B-Instruct
```
2025-02-27 10:37:15 -08:00
Xi Yan
200ef29233
chore: remove vector_db_id from AgentSessionInfo (#1296)
# What does this PR do?

- It is not being used anywhere and doesn't make sense to have 1 single
vector_db_id in an agent session. No top level API change.
- See
https://github.com/meta-llama/llama-stack/pull/1286#discussion_r1972569881

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

- See
https://github.com/meta-llama/llama-stack/pull/1286#discussion_r1972569881

[//]: # (## Documentation)
2025-02-27 10:13:10 -08:00
Ashwin Bharambe
981fc3c93c
fix(test): no need to specify tool prompt format explicitly in tests (#1295)
# What does this PR do?

No need to have complex tool prompt format related machinery in the
tests.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

```bash
LLAMA_STACK_CONFIG=ollama pytest -s -v tests/client-sdk/inference/test_text_inference.py --inference-model=meta-llama/Llama-3.2-3B-Instruct --vision-inference-model=""
```

[//]: # (## Documentation)
2025-02-27 10:09:57 -08:00
Xi Yan
fc5aff3ccf
feat: ability to retrieve agents session, turn, step by ids (#1286)
# What does this PR do?

- Fix up rotten implementation for retrieving agent's Session, Turn,
Step with actual working implementation.

- Update `getting_started` notebook with retrieving by agent session_id.
https://github.com/meta-llama/llama-stack/blob/export_agent_dataset/docs/getting_started.ipynb

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Test with script:
https://gist.github.com/yanxi0830/657cecee8f1f0e39d322963d9c0f598e

<img width="503" alt="image"
src="https://github.com/user-attachments/assets/5ea9bc33-83d1-40bc-98e1-b68393158387"
/>


[//]: # (## Documentation)
2025-02-27 09:45:14 -08:00
ehhuang
0762c61402
feat: don't silently ignore incorrect toolgroup (#1285) 2025-02-27 08:11:09 -05:00
Matthew Farrellee
99b6925ad8
feat: add nemo retriever text embedding models to nvidia inference provider (#1218)
# What does this PR do?

add the NeMo Retriever Embedding models from
https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/support-matrix.html
2025-02-26 21:18:34 -08:00
Ashwin Bharambe
23b65b6cee
fix(test): update client-sdk tests to handle tool format parametrization better (#1287)
# What does this PR do?

Tool format depends on the model. @ehhuang introduced a
`get_default_tool_prompt_format` function for this purpose. We should
use that instead of hacky model ID matching we had before.

Secondly, non llama models don't have this concept so testing with those
models should work as is.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

```bash
for distro in fireworks ollama; do
  LLAMA_STACK_CONFIG=$distro \
    pytest -s -v tests/client-sdk/inference/test_text_inference.py \
       --inference-model=meta-llama/Llama-3.2-3B-Instruct \
       --vision-inference-model=""
done

LLAMA_STACK_CONFIG=dev \
   pytest -s -v tests/client-sdk/inference/test_text_inference.py \
       --inference-model=openai/gpt-4o \
       --vision-inference-model=""

```

[//]: # (## Documentation)
2025-02-26 21:16:00 -08:00
Shrey
30ef1c3680
feat: Add model context protocol tools with ollama provider (#1283)
# What does this PR do?
Model context protocol (MCP) allows for remote tools to be connected
with Agents. The current Ollama provider does not support it. This PR
adds necessary code changes to ensure that the integration between
Ollama backend and MCP works.

This PR is an extension of #816 for Ollama. 

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

1. Run llama-stack server with the command:
```
llama stack build --template ollama --image-type conda
llama stack run ./templates/ollama/run.yaml \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env OLLAMA_URL=http://localhost:11434
```

2. Run the sample client agent with MCP tool:
```
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig
from llama_stack_client.types.shared_params.url import URL
from llama_stack_client import LlamaStackClient
from termcolor import cprint

## Start the local MCP server
# git clone https://github.com/modelcontextprotocol/python-sdk
# Follow instructions to get the env ready
# cd examples/servers/simple-tool
# uv run mcp-simple-tool --transport sse --port 8000

# Connect to the llama stack server
base_url="http://localhost:8321"
model_id="meta-llama/Llama-3.2-3B-Instruct"
client = LlamaStackClient(base_url=base_url)


# Register MCP tools
client.toolgroups.register(
    toolgroup_id="mcp::filesystem",
    provider_id="model-context-protocol",
    mcp_endpoint=URL(uri="http://localhost:8000/sse"))

# Define an agent with MCP toolgroup 
agent_config = AgentConfig(
    model=model_id,
    instructions="You are a helpful assistant",
    toolgroups=["mcp::filesystem"],
    input_shields=[],
    output_shields=[],
    enable_session_persistence=False,
)
agent = Agent(client, agent_config)
user_prompts = [
    "Fetch content from https://www.google.com and print the response"
]

# Run a session with the agent
session_id = agent.create_session("test-session")
for prompt in user_prompts:
    cprint(f"User> {prompt}", "green")
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )
    for log in EventLogger().log(response):
        log.print()
```
# Documentation
The file docs/source/distributions/self_hosted_distro/ollama.md is
updated to indicate the MCP tool runtime availability.

Signed-off-by: Shreyanand <shanand@redhat.com>
2025-02-26 15:38:18 -08:00
Ihar Hrachyshka
2250ab7274
fix: don't attempt to clean gpu memory up when device is cpu (#1191)
This is a follow up to:
https://github.com/meta-llama/llama-stack/pull/1140

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>

# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

Avoid unnecessary GPU memory clean attempt when the GPU is not used for
training.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

With CPU:

```
INFO 2025-02-26 16:43:56,267 torchtune.utils._logging:121: Model checkpoint of size 6.43 GB saved to /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth
INFO 2025-02-26 16:43:56,274 torchtune.utils._logging:132: Adapter checkpoint of size 0.00 GB saved to /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth
model_file_path /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0
```

With CUDA:

```
INFO 2025-02-26 21:39:24,314 torchtune.utils._logging:121: Model checkpoint of size 6.43 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth
INFO 2025-02-26 21:39:24,333 torchtune.utils._logging:132: Adapter checkpoint of size 0.00 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth
model_file_path /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0
```

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-02-26 15:12:11 -08:00
Ashwin Bharambe
21c547aa21
chore: upgrade uv pre-commit version, uv-sync -> uv-lock (#1284)
See
https://github.com/astral-sh/uv-pre-commit/blob/main/.pre-commit-hooks.yaml#L31-L40

`uv-sync` is supposed to be used on "post-checkout, post-rebase" etc.
The intention here is decidedly not that. The desire is if you changed
pyproject.toml, you should update `uv.lock` (and then also export it to
requirements.txt via uv-export).
2025-02-26 14:57:48 -08:00
ehhuang
270d64007a
fix: sqlite conn (#1282)
# Summary:
Our tests sometimes error out with
```
========================== 11 passed, 342 warnings in 58.86s ==========================
Error exporting span to SQLite: Cannot operate on a closed database.
Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=0x000000012af04280)

Current thread 0x00000001fa29c240 (most recent call first):
  <no Python frame>
```
Usually able to repro this by running 10 times.

The proposed fix is to use threadsafe var for creating sqlite connection
to ensure connection is only used by one thread. Not 100% if this is the
fix, but am not able to repro with this.

# Test Plan:
Run 10 times and saw no more errors
```
for i in {1..10}; do
  echo "=== Starting Run $i ==="
  LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B
  if [[ $? -ne 0 ]]; then
    echo "=== Run $i FAILED with exit code $? ==="
    break
  else
    echo "=== Run $i PASSED ==="
  fi
  echo
done
```
2025-02-26 14:44:31 -08:00
ehhuang
c8a20b8ed0
feat: allow specifying specific tool within toolgroup (#1239)
Summary:

E.g. `builtin::rag::knowledge_search`

Test Plan:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/ --safety-shield meta-llama/Llama-Guard-3-8B
```
2025-02-26 14:07:05 -08:00
Ashwin Bharambe
657efc67bc fix: bump up registry key version to clear off stale entries in dbs 2025-02-26 13:58:18 -08:00
Ashwin Bharambe
3f0b8c25aa fix: run uv-sync manually. locally pre-commit is not triggering 2025-02-26 13:54:08 -08:00
ehhuang
fca84db5b0
fix: time logging format (#1281)
Summary:
missed in last PR

Test Plan:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py::test_create_turn_response --safety-shield meta-llama/Llama-Guard-3-8B
```
2025-02-26 13:51:33 -08:00
Ashwin Bharambe
6b075e5075 feat: automatically update documentation version based on pyproject.toml source of truth 2025-02-26 13:42:12 -08:00
Botao Chen
9a3db9a290
feat: update the post training notebook (#1280)
##  What does this PR do?
- add 'open in colab' icon that links to the notebook
- update the pip install llama-stack pkg part

## test
preview
 
<img width="938" alt="Screenshot 2025-02-26 at 1 25 34 PM"
src="https://github.com/user-attachments/assets/951b7f0f-a15e-4618-ad02-07c77c65a5ad"
/>

<img width="934" alt="Screenshot 2025-02-26 at 1 25 38 PM"
src="https://github.com/user-attachments/assets/de872530-84b9-4f8b-ae93-06aa7d2e5bd8"
/>
2025-02-26 13:39:16 -08:00
ehhuang
bb2690f176
feat: remove special handling of builtin::rag tool (#1015)
Summary:

Lets the model decide which tool it needs to call to respond to a query.

Test Plan:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/ --safety-shield meta-llama/Llama-Guard-3-8B
```

Also evaluated on a small benchmark with 20 questions from HotpotQA.
With this PR and some prompting, the performance is 77% recall compared
to 50% currently.

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1015).
* #1268
* #1239
* __->__ #1015
2025-02-26 13:04:52 -08:00
Ben Browning
c64f0d5888
fix: Get builtin tool calling working in remote-vllm (#1236)
# What does this PR do?

This PR makes a couple of changes required to get the test
`tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search`
passing on the remote-vllm provider.

First, we adjust agent_instance to also pass in the description and
parameters of builtin tools. We need these parameters so we can pass the
tool's expected parameters into vLLM. The meta-reference implementations
may not have needed these for builtin tools, as they are able to take
advantage of the Llama-model specific support for certain builtin tools.
However, with vLLM, our server-side chat templates for tool calling
treat all tools the same and don't separate out Llama builtin vs custom
tools. So, we need to pass the full set of parameter definitions and
list of required parameters for builtin tools as well.

Next, we adjust the vllm streaming chat completion code to fix up some
edge cases where it was returning an extra ChatCompletionResponseEvent
with an empty ToolCall with empty string call_id, tool_name, and
arguments properties. This is a bug discovered after the above fix,
where after a successful tool invocation we were sending extra chunks
back to the client with these empty ToolCalls.

## Test Plan

With these changes, the following test that previously failed now
passes:

```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v \
tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search \
--inference-model "meta-llama/Llama-3.2-3B-Instruct"
```

Additionally, I ran the remote-vllm client-sdk and provider inference
tests as below to ensure they all still passed with this change:

```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v \
tests/client-sdk/inference/test_text_inference.py \
--inference-model "meta-llama/Llama-3.2-3B-Instruct"
```

```
VLLM_URL="http://localhost:8000/v1" \
python -m pytest -s -v \
llama_stack/providers/tests/inference/test_text_inference.py \
--providers "inference=vllm_remote"
```


[//]: # (## Documentation)

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-02-26 15:25:47 -05:00
Yuan Tang
2ed2c0bd26
fix(cli): Missing default for --image-type in stack run command (#1274)
# What does this PR do?

I think this got accidentally removed as part of
https://github.com/meta-llama/llama-stack/pull/1250. cc @leseb

## Test Plan

After the change, this arg is no longer required.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-26 12:23:44 -08:00