# What does this PR do?
The agent API allows to query multiple DBs using the `vector_db_ids`
argument of the `rag` tool:
```py
toolgroups=[
{
"name": "builtin::rag",
"args": {"vector_db_ids": [vector_db_id]},
}
],
```
This means that multiple DBs can be used to compose an aggregated
context by executing the query on each of them.
When documents are passed to the next agent turn, there is no explicit
way to configure the vector DB where the embeddings will be ingested. In
such cases, we can assume that:
- if any `vector_db_ids` is given, we use the first one (it probably
makes sense to assume that it's the only one in the list, otherwise we
should loop on all the given DBs to have a consistent ingestion)
- if no `vector_db_ids` is given, we can use the current logic to
generate a default DB using the default provider. If multiple providers
are defined, the API will fail as expected: the user has to provide
details on where to ingest the documents.
(Closes#1270)
## Test Plan
The issue description details how to replicate the problem.
[//]: # (## Documentation)
---------
Signed-off-by: Daniele Martinoli <dmartino@redhat.com>
All of the tests from `llama_stack/providers/tests/` are now moved to
`tests/integration`.
I converted the `tools`, `scoring` and `datasetio` tests to use API.
However, `eval` and `post_training` proved to be a bit challenging to
leaving those. I think `post_training` should be relatively
straightforward also.
As part of this, I noticed that `wolfram_alpha` tool wasn't added to
some of our commonly used distros so I added it. I am going to remove a
lot of code duplication from distros next so while this looks like a
one-off right now, it will go away and be there uniformly for all
distros.
Summary:
Test Plan:
added new test
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/api/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B
# What does this PR do?
- This was missed from previous deprecation:
https://github.com/meta-llama/llama-stack/pull/1186
- Part of https://github.com/meta-llama/llama-stack/issues/1396
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
pytest -v -s --nbval-lax ./llama-stack/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb
```
[//]: # (## Documentation)
# What does this PR do?
- Deprecate allow_turn_resume flag as this is used for staying backward
compat.
- Closes https://github.com/meta-llama/llama-stack/issues/1363
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/api/agents/test_agents.py --inference-model "meta-llama/Llama-3.3-70B-Instruct" --record-responses
```
<img width="1054" alt="image"
src="https://github.com/user-attachments/assets/d31de2d4-0953-41e1-a71a-7e1579fa351a"
/>
[//]: # (## Documentation)
Continues the refactor of tests.
Tests from `providers/tests` should be considered deprecated. For this
PR, I deleted most of the tests in
- inference
- safety
- agents
since much more comprehensive tests exist in
`tests/integration/{inference,safety,agents}` already.
I moved `test_persistence.py` from agents, but disabled all the tests
since that test needs to be properly migrated.
## Test Plan
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v agents --vision-inference-model=''
/Users/ashwin/homebrew/Caskroom/miniconda/base/envs/toolchain/lib/python3.10/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
======================================================================================================= test session starts ========================================================================================================
platform darwin -- Python 3.10.16, pytest-8.3.3, pluggy-1.5.0 -- /Users/ashwin/homebrew/Caskroom/miniconda/base/envs/toolchain/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform': 'macOS-15.3.1-arm64-arm-64bit', 'Packages': {'pytest': '8.3.3', 'pluggy': '1.5.0'}, 'Plugins': {'asyncio': '0.24.0', 'html': '4.1.1', 'metadata': '3.1.1', 'anyio': '4.8.0', 'nbval': '0.11.0'}}
rootdir: /Users/ashwin/local/llama-stack
configfile: pyproject.toml
plugins: asyncio-0.24.0, html-4.1.1, metadata-3.1.1, anyio-4.8.0, nbval-0.11.0
asyncio: mode=strict, default_loop_scope=None
collected 15 items
agents/test_agents.py::test_agent_simple[txt=8B] PASSED
agents/test_agents.py::test_tool_config[txt=8B] PASSED
agents/test_agents.py::test_builtin_tool_web_search[txt=8B] PASSED
agents/test_agents.py::test_builtin_tool_code_execution[txt=8B] PASSED
agents/test_agents.py::test_code_interpreter_for_attachments[txt=8B] PASSED
agents/test_agents.py::test_custom_tool[txt=8B] PASSED
agents/test_agents.py::test_custom_tool_infinite_loop[txt=8B] PASSED
agents/test_agents.py::test_tool_choice[txt=8B] PASSED
agents/test_agents.py::test_rag_agent[txt=8B-builtin::rag/knowledge_search] PASSED
agents/test_agents.py::test_rag_agent[txt=8B-builtin::rag] PASSED
agents/test_agents.py::test_rag_agent_with_attachments[txt=8B] PASSED
agents/test_agents.py::test_rag_and_code_agent[txt=8B] PASSED
agents/test_agents.py::test_create_turn_response[txt=8B] PASSED
agents/test_persistence.py::test_delete_agents_and_sessions SKIPPED (This test needs to be migrated to api / client-sdk world)
agents/test_persistence.py::test_get_agent_turns_and_steps SKIPPED (This test needs to be migrated to api / client-sdk world)
```
# What does this PR do?
Fix SQL syntax errors caused by hyphens in Vector DB IDs by sanitizing
table
# (Closes#1332 )
## Test Plan
Test confirms table names with hyphens are properly converted to
underscores
Summary:
1. The `tools` parameter we construct to pass the inference API is
non-deterministic. As a result, our recordable mocks is flaky as the
ordering change sometimes. This PR makes it so that `tools` ordering is
deterministic and aligned with the order user specified.
2. In recordable mock key generation, client tool's parameter type was
'str' and now is 'string' for some reason. I didn't dig into exactly
why, but just regenerated the fixtures.
Test Plan:
Regenerate mocks:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --record-responses
```
Rerun tests without --record-responses:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B
```
Move unittests to tests/unittests. Gradually nuking tests from
providers/tests/ and unifying them into tests/api (which are e2e tests
using SDK types)
## Test Plan
`pytest -s -v tests/unittests/`
# What does this PR do?
We currently use `max_infer_iters` in 2 different ways
1/ Server: track number of times
2/ Client side: track number of times we send `resume_turn` request
This PR gets rid of the need of (2) and makes server track total number
of times we perform inference within a Turn
**NOTE**
The PR will assume StopReason is set to
- end_of_message: turn is not finished, we could be waiting for client
tool call responses
- end_of_turn: if the entire turn is finished and there's no more things
to be done.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/agents/test_agents.py::test_custom_tool_infinite_loop --inference-model "meta-llama/Llama-3.3-70B-Instruct"
```
[//]: # (## Documentation)
A self-respecting server needs good observability which starts with
configurable logging. Llama Stack had little until now. This PR adds a
`logcat` facility towards that. Callsites look like:
```python
logcat.debug("inference", f"params to ollama: {params}")
```
- the first parameter is a category. there is a static list of
categories in `llama_stack/logcat.py`
- each category can be associated with a log-level which can be
configured via the `LLAMA_STACK_LOGGING` env var.
- a value `LLAMA_STACK_LOGGING=inference=debug;server=info"` does the
obvious thing. there is a special key called `all` which is an alias for
all categories
## Test Plan
Ran with `LLAMA_STACK_LOGGING="all=debug" llama stack run fireworks` and
saw the following:

Hit it with a client-sdk test case and saw this:

# What does this PR do?
We want to bundle a bunch of (typically remote) providers in a distro
template and be able to configure them "on the fly" via environment
variables. So far, we have been able to do this with simple env var
replacements. However, sometimes you want to only conditionally enable
providers (because the relevant remote services may not be alive, or
relevant.) This was not possible until now.
To aid this, we add a simple (bash-like) env var replacement
enhancement: `${env.FOO+bar}` evaluates to `bar` if the variable is SET
and evaluates to empty string if it is not. On top of that, we update
our main resolver to ignore any provider whose ID is null.
This allows using the distro like this:
```bash
llama stack run dev --env CHROMADB_URL=http://localhost:6001 --env ENABLE_CHROMADB=1
```
when only Chroma is UP. This disables the other `pgvector` provider in
the run configuration.
## Test Plan
Hard code `chromadb` as the vector io provider inside
`test_vector_io.py` and run:
```bash
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -s -v tests/client-sdk/vector_io/ --embedding-model all-MiniLM-L6-v2
```
# What does this PR do?
- using `eval` is a security risk
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
- see https://github.com/meta-llama/llama-stack/pull/1327
cc @SLR722 we will need to update the corresponding dataset via
```python
def update_to_json_str():
dataset = datasets.load_dataset(...)
processed_dataset = dataset[split].map(
lambda x: {
"column": json.dumps(eval(x["column"]))
}
)
processed_dataset.push_to_hub(...)
```
[//]: # (## Documentation)
# What does this PR do?
- Using `eval` on server is a security risk
- Replace `eval` with `json.loads`
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
pytest -v -s --nbval-lax ./llama-stack/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb
```
<img width="747" alt="image"
src="https://github.com/user-attachments/assets/7aff3d95-0b12-4394-b9d0-aeff791eee38"
/>
[//]: # (## Documentation)
# What does this PR do?
This is to be consistent with OpenAI API and support vLLM <= v0.6.3
References:
*
https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice
* https://github.com/vllm-project/vllm/pull/10000
This fixes the error when running older versions of vLLM:
```
00:50:19.834 [START] /v1/inference/chat-completion
INFO 2025-02-28 00:50:20,203 httpx:1025: HTTP Request: POST https://api-xeai-granite-3-1-8b-instruct.apps.int.stc.ai.preprod.us-east-1.aws.paas.redhat.com/v1/chat/completions "HTTP/1.1 400 Bad Request"
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 235, in endpoint
return await maybe_await(value)
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 201, in maybe_await
return await value
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 89, in async_wrapper
result = await method(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 193, in chat_completion
return await provider.chat_completion(**params)
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 89, in async_wrapper
result = await method(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 286, in chat_completion
return await self._nonstream_chat_completion(request, self.client)
File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 292, in _nonstream_chat_completion
r = client.chat.completions.create(**params)
File "/usr/local/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py", line 879, in create
return self._post(
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1290, in post
return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 967, in request
return self._request(
File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1071, in _request
raise self._make_status_error_from_response(err.response) from None
openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "[{'type': 'value_error', 'loc': ('body',), 'msg': 'Value error, When using `tool_choice`, `tools` must be set.', 'input': {'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': 'What model are you?'}]}], 'model': 'granite-3-1-8b-instruct', 'max_tokens': 4096, 'stream': False, 'temperature': 0.0, 'tools': None, 'tool_choice': 'auto'}, 'ctx': {'error': ValueError('When using `tool_choice`, `tools` must be set.')}}]", 'type': 'BadRequestError', 'param': None, 'code': 400}
INFO: 2600:1700:9d20:ac0::49:59736 - "POST /v1/inference/chat-completion HTTP/1.1" 500 Internal Server Error
00:50:20.266 [END] /v1/inference/chat-completion [StatusCode.OK] (431.99ms)
```
## Test Plan
All existing tests pass.
---------
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Original telemetry outputs for agent turns look like this.
Note: how output was a `str(message)` making it difficult to read them
back for downstream tasks ( eg. building eval datasets )
```
{
│ │ 'input': [
│ │ │ '{"role":"system","content":"You are a helpful assistant. Use search tool to answer the questions. "}',
│ │ │ '{"role":"user","content":"Which teams played in the NBA western conference finals of 2024","context":null}'
│ │ ],
│ │ 'output': "content: tool_calls: [ToolCall(call_id='8b7294ec-a83f-4798-ad8f-6bed662f08b6', tool_name=<BuiltinTool.brave_search: 'brave_search'>, arguments={'query': 'NBA Western Conference Finals 2024 teams'})]"
│ },
```
Updated the outputs to be structured .
## Test
```python
import uuid
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig
model_id = "meta-llama/Llama-3.1-8B-Instruct"
agent_config = AgentConfig(
model=model_id,
instructions="You are a helpful assistant who will use the web search tools to help with answering questions.\nOnly provide final answer in short without writing full sentences. Use web search",
toolgroups=["builtin::websearch"],
enable_session_persistence=True,
)
agent = Agent(client, agent_config)
session_id = agent.create_session(uuid.uuid4().hex)
response = agent.create_turn(
messages=[
{
"role": "user",
"content": "latest news about llama stack",
}
],
session_id=session_id,
stream=False,
)
pprint(response)
```
Output:
```
Turn(
│ input_messages=[UserMessage(content='latest news about llama stack', role='user', context=None)],
│ output_message=CompletionMessage(
│ │ content="The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.",
│ │ role='assistant',
│ │ stop_reason='end_of_turn',
│ │ tool_calls=[]
│ ),
│ session_id='77379546-4598-485a-b4f4-84e5da28c513',
│ started_at=datetime.datetime(2025, 2, 27, 11, 2, 43, 915243, tzinfo=TzInfo(-08:00)),
│ steps=[
│ │ InferenceStep(
│ │ │ api_model_response=CompletionMessage(
│ │ │ │ content='',
│ │ │ │ role='assistant',
│ │ │ │ stop_reason='end_of_turn',
│ │ │ │ tool_calls=[
│ │ │ │ │ ToolCall(
│ │ │ │ │ │ arguments={'query': 'latest news llama stack'},
│ │ │ │ │ │ call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b',
│ │ │ │ │ │ tool_name='brave_search'
│ │ │ │ │ )
│ │ │ │ ]
│ │ │ ),
│ │ │ step_id='81c16bd3-eb00-4721-8edc-f386e07391a3',
│ │ │ step_type='inference',
│ │ │ turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45',
│ │ │ completed_at=datetime.datetime(2025, 2, 27, 11, 2, 44, 637149, tzinfo=TzInfo(-08:00)),
│ │ │ started_at=datetime.datetime(2025, 2, 27, 11, 2, 43, 915831, tzinfo=TzInfo(-08:00))
│ │ ),
│ │ ToolExecutionStep(
│ │ │ step_id='4782d609-a62e-45f5-8d2a-25a43db46288',
│ │ │ step_type='tool_execution',
│ │ │ tool_calls=[
│ │ │ │ ToolCall(
│ │ │ │ │ arguments={'query': 'latest news llama stack'},
│ │ │ │ │ call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b',
│ │ │ │ │ tool_name='brave_search'
│ │ │ │ )
│ │ │ ],
│ │ │ tool_responses=[
│ │ │ │ ToolResponse(
│ │ │ │ │ call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b',
│ │ │ │ │ content='{"query": "latest news llama stack", "top_k": [{"title": "Llama 3.2: Revol. ....... Hacker News.", "score": 0.6186197, "raw_content": null}]}',
│ │ │ │ │ tool_name='brave_search',
│ │ │ │ │ metadata=None
│ │ │ │ )
│ │ │ ],
│ │ │ turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45',
│ │ │ completed_at=datetime.datetime(2025, 2, 27, 11, 2, 46, 272176, tzinfo=TzInfo(-08:00)),
│ │ │ started_at=datetime.datetime(2025, 2, 27, 11, 2, 44, 640743, tzinfo=TzInfo(-08:00))
│ │ ),
│ │ InferenceStep(
│ │ │ api_model_response=CompletionMessage(
│ │ │ │ content="The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.",
│ │ │ │ role='assistant',
│ │ │ │ stop_reason='end_of_turn',
│ │ │ │ tool_calls=[]
│ │ │ ),
│ │ │ step_id='37994419-5da3-4e84-a010-8d9b85366262',
│ │ │ step_type='inference',
│ │ │ turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45',
│ │ │ completed_at=datetime.datetime(2025, 2, 27, 11, 2, 48, 961275, tzinfo=TzInfo(-08:00)),
│ │ │ started_at=datetime.datetime(2025, 2, 27, 11, 2, 46, 273168, tzinfo=TzInfo(-08:00))
│ │ )
│ ],
│ turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45',
│ completed_at=datetime.datetime(2025, 2, 27, 11, 2, 48, 962318, tzinfo=TzInfo(-08:00)),
│ output_attachments=[]
)
```
## Check for Telemetry
```python
agent_logs = []
for span in client.telemetry.query_spans(
attribute_filters=[
{"key": "session_id", "op": "eq", "value": session_id},
],
attributes_to_return=['input', 'output'],
):
agent_logs.append(span.attributes)
pprint(json.loads(agent_logs[-1]['output']))
```
```
{
│ 'content': "The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.",
│ 'tool_calls': []
}
```
`ChatCompletionResponseEventType: start` is ignored and not yielded in
the agent_instance as we expect that to not have any content.
However, litellm sends first event as `ChatCompletionResponseEventType:
start` with content ( which was the first token that we were skipping )
```
LLAMA_STACK_CONFIG=dev pytest -s -v tests/client-sdk/agents/test_agents.py --inference-model "openai/gpt-4o-mini" -k test_agent_simple
```
This was failing before ( since the word hello was not in the final
response )
# What does this PR do?
updates nvidia inference provider's embedding implementation to use new
signature
add support for task_type, output_dimensions, text_truncation parameters
## Test Plan
`LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v
tests/client-sdk/inference/test_embedding.py --embedding-model
baai/bge-m3`
Now that remote-vllm include inline::sentence_transformers there is an
issue building the image:
Error building stack:
SentenceTransformersInferenceConfig.sample_run_config() got an
unexpected keyword argument '__distro_dir__'
To avoid that issue this fix extends the sample_run_config to accept
extra kwargs
Each model known to the system has two identifiers:
- the `provider_resource_id` (what the provider calls it) -- e.g.,
`accounts/fireworks/models/llama-v3p1-8b-instruct`
- the `identifier` (`model_id`) under which it is registered and gets
routed to the appropriate provider.
We have so far used the HuggingFace repo alias as the standardized
identifier you can use to refer to the model. So in the above example,
we'd use `meta-llama/Llama-3.1-8B-Instruct` as the name under which it
gets registered. This makes it convenient for users to refer to these
models across providers.
However, we forgot to register the _actual_ provider model ID also. You
should be able to route via `provider_resource_id` also, of course.
This change fixes this (somewhat grave) omission.
*Note*: this change is additive -- more aliases work now compared to
before.
## Test Plan
Run the following for distro=(ollama fireworks together)
```
LLAMA_STACK_CONFIG=$distro \
pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model=meta-llama/Llama-3.1-8B-Instruct --vision-inference-model=""
```
# Summary:
Right now we would include toolgroup args when we encode messages with
tool_calls, which is confusing the model since they not in the function
description (see test plan for example).
# Test Plan:
Add a print statement before raw prompt is sent to providers (no good
way to test this currently)
Before:
```
cated in the same neighborhood?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n[knowledge_search(query="Laleli Mosque and Esma Sultan Mansion same neighborhood", vector_db_ids=["829a68735d744dc3830409dcc782964a"])]<|eot_id|><|start_header_id|>ipython<|end_header_id|>\n\nknowledge_search tool found 5 chunks:\nBEGIN of
```
Note the extra `vector_db_ids`
After
```
>user<|end_header_id|>\n\nAre the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n[knowledge_search(query="Laleli Mosque and Esma Sultan Mansion same neighborhood")]<|eot_id|><|start_header_id|>ipython<|end_header_id|>\n\nknowledge_search tool found
```
Groq has never supported raw completions anyhow. So this makes it easier
to switch it to LiteLLM. All our test suite passes.
I also updated all the openai-compat providers so they work with api
keys passed from headers. `provider_data`
## Test Plan
```bash
LLAMA_STACK_CONFIG=groq \
pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model=groq/llama-3.3-70b-versatile --vision-inference-model=""
```
Also tested (openai, anthropic, gemini) providers. No regressions.
# What does this PR do?
Tool format depends on the model. @ehhuang introduced a
`get_default_tool_prompt_format` function for this purpose. We should
use that instead of hacky model ID matching we had before.
Secondly, non llama models don't have this concept so testing with those
models should work as is.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```bash
for distro in fireworks ollama; do
LLAMA_STACK_CONFIG=$distro \
pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model=meta-llama/Llama-3.2-3B-Instruct \
--vision-inference-model=""
done
LLAMA_STACK_CONFIG=dev \
pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model=openai/gpt-4o \
--vision-inference-model=""
```
[//]: # (## Documentation)
This is a follow up to:
https://github.com/meta-llama/llama-stack/pull/1140
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
Avoid unnecessary GPU memory clean attempt when the GPU is not used for
training.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
With CPU:
```
INFO 2025-02-26 16:43:56,267 torchtune.utils._logging:121: Model checkpoint of size 6.43 GB saved to /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth
INFO 2025-02-26 16:43:56,274 torchtune.utils._logging:132: Adapter checkpoint of size 0.00 GB saved to /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth
model_file_path /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0
```
With CUDA:
```
INFO 2025-02-26 21:39:24,314 torchtune.utils._logging:121: Model checkpoint of size 6.43 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth
INFO 2025-02-26 21:39:24,333 torchtune.utils._logging:132: Adapter checkpoint of size 0.00 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth
model_file_path /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0
```
[//]: # (## Documentation)
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# Summary:
Our tests sometimes error out with
```
========================== 11 passed, 342 warnings in 58.86s ==========================
Error exporting span to SQLite: Cannot operate on a closed database.
Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=0x000000012af04280)
Current thread 0x00000001fa29c240 (most recent call first):
<no Python frame>
```
Usually able to repro this by running 10 times.
The proposed fix is to use threadsafe var for creating sqlite connection
to ensure connection is only used by one thread. Not 100% if this is the
fix, but am not able to repro with this.
# Test Plan:
Run 10 times and saw no more errors
```
for i in {1..10}; do
echo "=== Starting Run $i ==="
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B
if [[ $? -ne 0 ]]; then
echo "=== Run $i FAILED with exit code $? ==="
break
else
echo "=== Run $i PASSED ==="
fi
echo
done
```
Summary:
Lets the model decide which tool it needs to call to respond to a query.
Test Plan:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/ --safety-shield meta-llama/Llama-Guard-3-8B
```
Also evaluated on a small benchmark with 20 questions from HotpotQA.
With this PR and some prompting, the performance is 77% recall compared
to 50% currently.
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1015).
* #1268
* #1239
* __->__ #1015