llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-03 18:00:36 +00:00

Author	SHA1	Message	Date
Daniele Martinoli	fb998683e0	fix: Agent uses the first configured vector_db_id when documents are provided (#1276 ) # What does this PR do? The agent API allows to query multiple DBs using the `vector_db_ids` argument of the `rag` tool: ```py toolgroups=[ { "name": "builtin::rag", "args": {"vector_db_ids": [vector_db_id]}, } ], ``` This means that multiple DBs can be used to compose an aggregated context by executing the query on each of them. When documents are passed to the next agent turn, there is no explicit way to configure the vector DB where the embeddings will be ingested. In such cases, we can assume that: - if any `vector_db_ids` is given, we use the first one (it probably makes sense to assume that it's the only one in the list, otherwise we should loop on all the given DBs to have a consistent ingestion) - if no `vector_db_ids` is given, we can use the current logic to generate a default DB using the default provider. If multiple providers are defined, the API will fail as expected: the user has to provide details on where to ingest the documents. (Closes #1270) ## Test Plan The issue description details how to replicate the problem. [//]: # (## Documentation) --------- Signed-off-by: Daniele Martinoli <dmartino@redhat.com>	2025-03-04 21:44:13 -08:00
Xi Yan	78962be996	chore: refactor create_and_execute_turn and resume_turn (#1399 ) # What does this PR do? - Closes https://github.com/meta-llama/llama-stack/issues/1212 [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/integration/agents/test_agents.py --inference-model "meta-llama/Llama-3.3-70B-Instruct" ``` <img width="1203" alt="image" src="https://github.com/user-attachments/assets/35b60017-b3f2-4e98-87f2-2868730261bd" /> ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/agents/test_agents.py::test_rag_and_code_agent --inference-model "meta-llama/Llama-3.3-70B-Instruct" ``` [//]: # (## Documentation)	2025-03-04 16:07:30 -08:00
Ashwin Bharambe	abfbaf3c1b	refactor(test): move tools, evals, datasetio, scoring and post training tests (#1401 ) All of the tests from `llama_stack/providers/tests/` are now moved to `tests/integration`. I converted the `tools`, `scoring` and `datasetio` tests to use API. However, `eval` and `post_training` proved to be a bit challenging to leaving those. I think `post_training` should be relatively straightforward also. As part of this, I noticed that `wolfram_alpha` tool wasn't added to some of our commonly used distros so I added it. I am going to remove a lot of code duplication from distros next so while this looks like a one-off right now, it will go away and be there uniformly for all distros.	2025-03-04 14:53:47 -08:00
Ashwin Bharambe	dd0db8038b	refactor(test): unify vector_io tests and make them configurable (#1398 ) ## Test Plan `LLAMA_STACK_CONFIG=inference=sentence-transformers,vector_io=sqlite-vec pytest -s -v test_vector_io.py --embedding-model all-miniLM-L6-V2 --inference-model='' --vision-inference-model=''` ``` test_vector_io.py::test_vector_db_retrieve[txt=:vis=:emb=all-miniLM-L6-V2] PASSED test_vector_io.py::test_vector_db_register[txt=:vis=:emb=all-miniLM-L6-V2] PASSED test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case0] PASSED test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case1] PASSED test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case2] PASSED test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case3] PASSED test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case4] PASSED ``` Same thing with: - LLAMA_STACK_CONFIG=inference=sentence-transformers,vector_io=faiss - LLAMA_STACK_CONFIG=fireworks (Note that ergonomics will soon be improved re: cmd-line options and env variables)	2025-03-04 13:37:45 -08:00
ehhuang	fd8c991393	fix: rag as attachment bug (#1392 ) Summary: Test Plan: added new test LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/api/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B	2025-03-04 13:08:16 -08:00
Xi Yan	e9a37bad63	chore: rename task_config to benchmark_config (#1397 ) # What does this PR do? - This was missed from previous deprecation: https://github.com/meta-llama/llama-stack/pull/1186 - Part of https://github.com/meta-llama/llama-stack/issues/1396 [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` pytest -v -s --nbval-lax ./llama-stack/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb ``` [//]: # (## Documentation)	2025-03-04 12:44:04 -08:00
Xi Yan	158b6dc404	chore: deprecate allow_turn_resume (#1377 ) # What does this PR do? - Deprecate allow_turn_resume flag as this is used for staying backward compat. - Closes https://github.com/meta-llama/llama-stack/issues/1363 [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/api/agents/test_agents.py --inference-model "meta-llama/Llama-3.3-70B-Instruct" --record-responses ``` <img width="1054" alt="image" src="https://github.com/user-attachments/assets/d31de2d4-0953-41e1-a71a-7e1579fa351a" /> [//]: # (## Documentation)	2025-03-04 12:22:11 -08:00
Ashwin Bharambe	cad5eed4b5	refactor(tests): delete inference, safety and agents tests from providers/tests/ (#1393 ) Continues the refactor of tests. Tests from `providers/tests` should be considered deprecated. For this PR, I deleted most of the tests in - inference - safety - agents since much more comprehensive tests exist in `tests/integration/{inference,safety,agents}` already. I moved `test_persistence.py` from agents, but disabled all the tests since that test needs to be properly migrated. ## Test Plan ``` LLAMA_STACK_CONFIG=fireworks pytest -s -v agents --vision-inference-model='' /Users/ashwin/homebrew/Caskroom/miniconda/base/envs/toolchain/lib/python3.10/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset. The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session" warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET)) ======================================================================================================= test session starts ======================================================================================================== platform darwin -- Python 3.10.16, pytest-8.3.3, pluggy-1.5.0 -- /Users/ashwin/homebrew/Caskroom/miniconda/base/envs/toolchain/bin/python cachedir: .pytest_cache metadata: {'Python': '3.10.16', 'Platform': 'macOS-15.3.1-arm64-arm-64bit', 'Packages': {'pytest': '8.3.3', 'pluggy': '1.5.0'}, 'Plugins': {'asyncio': '0.24.0', 'html': '4.1.1', 'metadata': '3.1.1', 'anyio': '4.8.0', 'nbval': '0.11.0'}} rootdir: /Users/ashwin/local/llama-stack configfile: pyproject.toml plugins: asyncio-0.24.0, html-4.1.1, metadata-3.1.1, anyio-4.8.0, nbval-0.11.0 asyncio: mode=strict, default_loop_scope=None collected 15 items agents/test_agents.py::test_agent_simple[txt=8B] PASSED agents/test_agents.py::test_tool_config[txt=8B] PASSED agents/test_agents.py::test_builtin_tool_web_search[txt=8B] PASSED agents/test_agents.py::test_builtin_tool_code_execution[txt=8B] PASSED agents/test_agents.py::test_code_interpreter_for_attachments[txt=8B] PASSED agents/test_agents.py::test_custom_tool[txt=8B] PASSED agents/test_agents.py::test_custom_tool_infinite_loop[txt=8B] PASSED agents/test_agents.py::test_tool_choice[txt=8B] PASSED agents/test_agents.py::test_rag_agent[txt=8B-builtin::rag/knowledge_search] PASSED agents/test_agents.py::test_rag_agent[txt=8B-builtin::rag] PASSED agents/test_agents.py::test_rag_agent_with_attachments[txt=8B] PASSED agents/test_agents.py::test_rag_and_code_agent[txt=8B] PASSED agents/test_agents.py::test_create_turn_response[txt=8B] PASSED agents/test_persistence.py::test_delete_agents_and_sessions SKIPPED (This test needs to be migrated to api / client-sdk world) agents/test_persistence.py::test_get_agent_turns_and_steps SKIPPED (This test needs to be migrated to api / client-sdk world) ```	2025-03-04 10:41:57 -08:00
Alexey Rybak	d57cffb495	fix(pgvector): replace hyphens with underscores in table names (#1385 ) # What does this PR do? Fix SQL syntax errors caused by hyphens in Vector DB IDs by sanitizing table # (Closes #1332 ) ## Test Plan Test confirms table names with hyphens are properly converted to underscores	2025-03-04 07:06:35 -08:00
ehhuang	07a992ef90	feat: deterministic tools ordering (#1380 ) Summary: 1. The `tools` parameter we construct to pass the inference API is non-deterministic. As a result, our recordable mocks is flaky as the ordering change sometimes. This PR makes it so that `tools` ordering is deterministic and aligned with the order user specified. 2. In recordable mock key generation, client tool's parameter type was 'str' and now is 'string' for some reason. I didn't dig into exactly why, but just regenerated the fixtures. Test Plan: Regenerate mocks: ``` LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --record-responses ``` Rerun tests without --record-responses: ``` LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B ```	2025-03-03 20:38:07 -08:00
Ashwin Bharambe	86fc514abb	refactor: move more tests, delete some providers tests (#1382 ) Move unittests to tests/unittests. Gradually nuking tests from providers/tests/ and unifying them into tests/api (which are e2e tests using SDK types) ## Test Plan `pytest -s -v tests/unittests/`	2025-03-03 20:28:34 -08:00
Ashwin Bharambe	5736c7d682	refactor: move tests/client-sdk to tests/api (#1376 ) This PR moves the client-sdk tests to the api directory to better reflect their purpose and improve code organization.	2025-03-03 17:28:12 -08:00
Ashwin Bharambe	0a76ece249	feat: add more logs to agent_instance.py	2025-03-03 16:15:47 -08:00
ehhuang	ee5e9b935a	feat: better using get_default_tool_prompt_format (#1360 ) Summary: https://github.com/meta-llama/llama-stack/pull/1214 introduced `get_default_tool_prompt_format` but tried to use it on the raw identifier. Here we move calling this func later in the stack and rely on the inference provider to resolve the raw identifier into llama model, then call get_default_tool_prompt_format. Test Plan: ``` LLAMA_STACK_CONFIG=ollama pytest -s -v tests/client-sdk/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming --inference-model=llama3.2:3b-instruct-fp16 --vision-inference-model="" ``` Before: <img width="1288" alt="image" src="https://github.com/user-attachments/assets/918c7839-1f45-4540-864e-4b842cc367df" /> After: <img width="1522" alt="image" src="https://github.com/user-attachments/assets/447d78af-b3b9-4837-8cb7-6ac549005efe" />	2025-03-03 14:50:06 -08:00
Ashwin Bharambe	816fdf289a	refactor: move generation.py to llama3	2025-03-03 13:50:19 -08:00
Ashwin Bharambe	02066591b8	refactor: move generation.py to llama3	2025-03-03 13:46:50 -08:00
Ashwin Bharambe	725423c95c	refactor: move llama3 impl to meta_reference provider (#1364 ) Just moving bits to a better place ## Test Plan ```bash torchrun $CONDA_PREFIX/bin/pytest -s -v test_text_inference.py ```	2025-03-03 13:22:57 -08:00
Xi Yan	7d111c7510	feat: unify max_infer_iters in client/server agent loop (#1309 ) # What does this PR do? We currently use `max_infer_iters` in 2 different ways 1/ Server: track number of times 2/ Client side: track number of times we send `resume_turn` request This PR gets rid of the need of (2) and makes server track total number of times we perform inference within a Turn NOTE The PR will assume StopReason is set to - end_of_message: turn is not finished, we could be waiting for client tool call responses - end_of_turn: if the entire turn is finished and there's no more things to be done. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/agents/test_agents.py::test_custom_tool_infinite_loop --inference-model "meta-llama/Llama-3.3-70B-Instruct" ``` [//]: # (## Documentation)	2025-03-03 10:08:36 -08:00
Ashwin Bharambe	754feba61f	feat: add a configurable category-based logger (#1352 ) A self-respecting server needs good observability which starts with configurable logging. Llama Stack had little until now. This PR adds a `logcat` facility towards that. Callsites look like: ```python logcat.debug("inference", f"params to ollama: {params}") ``` - the first parameter is a category. there is a static list of categories in `llama_stack/logcat.py` - each category can be associated with a log-level which can be configured via the `LLAMA_STACK_LOGGING` env var. - a value `LLAMA_STACK_LOGGING=inference=debug;server=info"` does the obvious thing. there is a special key called `all` which is an alias for all categories ## Test Plan Ran with `LLAMA_STACK_LOGGING="all=debug" llama stack run fireworks` and saw the following: ![image](https://github.com/user-attachments/assets/d24b95ab-3941-426c-9ea0-a4c62542e6f0) Hit it with a client-sdk test case and saw this: ![image](https://github.com/user-attachments/assets/3fee8c6c-986e-4125-a09c-f5dc019682e2)	2025-03-02 18:51:14 -08:00
Ashwin Bharambe	46b0a404e8	chore: remove straggler references to llama-models (#1345 ) Straggler references cleanup	2025-03-01 14:26:03 -08:00
Ashwin Bharambe	8bbd52bb9f	chore: remove dependency on llama_models completely (#1344 )	2025-03-01 12:48:08 -08:00
Ashwin Bharambe	6609d4ada4	feat: allow conditionally enabling providers in run.yaml (#1321 ) # What does this PR do? We want to bundle a bunch of (typically remote) providers in a distro template and be able to configure them "on the fly" via environment variables. So far, we have been able to do this with simple env var replacements. However, sometimes you want to only conditionally enable providers (because the relevant remote services may not be alive, or relevant.) This was not possible until now. To aid this, we add a simple (bash-like) env var replacement enhancement: `${env.FOO+bar}` evaluates to `bar` if the variable is SET and evaluates to empty string if it is not. On top of that, we update our main resolver to ignore any provider whose ID is null. This allows using the distro like this: ```bash llama stack run dev --env CHROMADB_URL=http://localhost:6001 --env ENABLE_CHROMADB=1 ``` when only Chroma is UP. This disables the other `pgvector` provider in the run configuration. ## Test Plan Hard code `chromadb` as the vector io provider inside `test_vector_io.py` and run: ```bash LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -s -v tests/client-sdk/vector_io/ --embedding-model all-MiniLM-L6-v2 ```	2025-03-01 11:19:14 -08:00
ehhuang	21ec67356c	fix: RAG with documents (#1337 ) Summary: This was broken by https://github.com/meta-llama/llama-stack/pull/1015/files#r1975394190 Test Plan: added e2e test	2025-02-28 16:51:00 -08:00
ehhuang	2faee24873	chore: better raise (#1335 ) Summary: addresses https://github.com/meta-llama/llama-stack/pull/1282#discussion_r1972546802 Test Plan:	2025-02-28 16:41:20 -08:00
Xi Yan	15f69e75ff	fix: replace eval with json decoding for format_adapter (#1328 ) # What does this PR do? - using `eval` is a security risk [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan - see https://github.com/meta-llama/llama-stack/pull/1327 cc @SLR722 we will need to update the corresponding dataset via ```python def update_to_json_str(): dataset = datasets.load_dataset(...) processed_dataset = dataset[split].map( lambda x: { "column": json.dumps(eval(x["column"])) } ) processed_dataset.push_to_hub(...) ``` [//]: # (## Documentation)	2025-02-28 11:25:23 -08:00
Xi Yan	6520baebed	fix: replace eval with json decoding (#1327 ) # What does this PR do? - Using `eval` on server is a security risk - Replace `eval` with `json.loads` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` pytest -v -s --nbval-lax ./llama-stack/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb ``` <img width="747" alt="image" src="https://github.com/user-attachments/assets/7aff3d95-0b12-4394-b9d0-aeff791eee38" /> [//]: # (## Documentation)	2025-02-28 11:10:45 -08:00
Yuan Tang	18ab1985da	fix: Make remote::vllm compatible with vLLM <= v0.6.3 (#1325 ) # What does this PR do? This is to be consistent with OpenAI API and support vLLM <= v0.6.3 References: * https://platform.openai.com/docs/api-reference/chat/create#chat-create-tool_choice * https://github.com/vllm-project/vllm/pull/10000 This fixes the error when running older versions of vLLM: ``` 00:50:19.834 [START] /v1/inference/chat-completion INFO 2025-02-28 00:50:20,203 httpx:1025: HTTP Request: POST https://api-xeai-granite-3-1-8b-instruct.apps.int.stc.ai.preprod.us-east-1.aws.paas.redhat.com/v1/chat/completions "HTTP/1.1 400 Bad Request" Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 235, in endpoint return await maybe_await(value) File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/server/server.py", line 201, in maybe_await return await value File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 89, in async_wrapper result = await method(self, args, kwargs) File "/usr/local/lib/python3.10/site-packages/llama_stack/distribution/routers/routers.py", line 193, in chat_completion return await provider.chat_completion(params) File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/utils/telemetry/trace_protocol.py", line 89, in async_wrapper result = await method(self, args, kwargs) File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 286, in chat_completion return await self._nonstream_chat_completion(request, self.client) File "/usr/local/lib/python3.10/site-packages/llama_stack/providers/remote/inference/vllm/vllm.py", line 292, in _nonstream_chat_completion r = client.chat.completions.create(params) File "/usr/local/lib/python3.10/site-packages/openai/_utils/_utils.py", line 279, in wrapper return func(args, *kwargs) File "/usr/local/lib/python3.10/site-packages/openai/resources/chat/completions/completions.py", line 879, in create return self._post( File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1290, in post return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls)) File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 967, in request return self._request( File "/usr/local/lib/python3.10/site-packages/openai/_base_client.py", line 1071, in _request raise self._make_status_error_from_response(err.response) from None openai.BadRequestError: Error code: 400 - {'object': 'error', 'message': "[{'type': 'value_error', 'loc': ('body',), 'msg': 'Value error, When using `tool_choice`, `tools` must be set.', 'input': {'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': 'What model are you?'}]}], 'model': 'granite-3-1-8b-instruct', 'max_tokens': 4096, 'stream': False, 'temperature': 0.0, 'tools': None, 'tool_choice': 'auto'}, 'ctx': {'error': ValueError('When using `tool_choice`, `tools` must be set.')}}]", 'type': 'BadRequestError', 'param': None, 'code': 400} INFO: 2600:1700:9d20:ac0::49:59736 - "POST /v1/inference/chat-completion HTTP/1.1" 500 Internal Server Error 00:50:20.266 [END] /v1/inference/chat-completion [StatusCode.OK] (431.99ms) ``` ## Test Plan All existing tests pass. --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-02-28 12:48:49 -05:00
Sébastien Han	6fa257b475	chore(lint): update Ruff ignores for project conventions and maintainability (#1184 ) - Added new ignores from flake8-bugbear (`B007`, `B008`) - Ignored `C901` (high function complexity) for now, pending review - Maintained PyTorch conventions (`N812`, `N817`) - Allowed `E731` (lambda assignments) for flexibility - Consolidated existing ignores (`E402`, `E501`, `F405`, `C408`, `N812`) - Documented rationale for each ignored rule This keeps our linting aligned with project needs while tracking potential fixes. Signed-off-by: Sébastien Han <seb@redhat.com> Signed-off-by: Sébastien Han <seb@redhat.com>	2025-02-28 09:36:49 -08:00
Hardik Shah	8efa53daf1	fix: Agent telemetry inputs/outputs should be structured (#1302 ) Original telemetry outputs for agent turns look like this. Note: how output was a `str(message)` making it difficult to read them back for downstream tasks ( eg. building eval datasets ) ``` { │ │ 'input': [ │ │ │ '{"role":"system","content":"You are a helpful assistant. Use search tool to answer the questions. "}', │ │ │ '{"role":"user","content":"Which teams played in the NBA western conference finals of 2024","context":null}' │ │ ], │ │ 'output': "content: tool_calls: [ToolCall(call_id='8b7294ec-a83f-4798-ad8f-6bed662f08b6', tool_name=<BuiltinTool.brave_search: 'brave_search'>, arguments={'query': 'NBA Western Conference Finals 2024 teams'})]" │ }, ``` Updated the outputs to be structured . ## Test ```python import uuid from llama_stack_client.lib.agents.agent import Agent from llama_stack_client.lib.agents.event_logger import EventLogger from llama_stack_client.types.agent_create_params import AgentConfig model_id = "meta-llama/Llama-3.1-8B-Instruct" agent_config = AgentConfig( model=model_id, instructions="You are a helpful assistant who will use the web search tools to help with answering questions.\nOnly provide final answer in short without writing full sentences. Use web search", toolgroups=["builtin::websearch"], enable_session_persistence=True, ) agent = Agent(client, agent_config) session_id = agent.create_session(uuid.uuid4().hex) response = agent.create_turn( messages=[ { "role": "user", "content": "latest news about llama stack", } ], session_id=session_id, stream=False, ) pprint(response) ``` Output: ``` Turn( │ input_messages=[UserMessage(content='latest news about llama stack', role='user', context=None)], │ output_message=CompletionMessage( │ │ content="The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.", │ │ role='assistant', │ │ stop_reason='end_of_turn', │ │ tool_calls=[] │ ), │ session_id='77379546-4598-485a-b4f4-84e5da28c513', │ started_at=datetime.datetime(2025, 2, 27, 11, 2, 43, 915243, tzinfo=TzInfo(-08:00)), │ steps=[ │ │ InferenceStep( │ │ │ api_model_response=CompletionMessage( │ │ │ │ content='', │ │ │ │ role='assistant', │ │ │ │ stop_reason='end_of_turn', │ │ │ │ tool_calls=[ │ │ │ │ │ ToolCall( │ │ │ │ │ │ arguments={'query': 'latest news llama stack'}, │ │ │ │ │ │ call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b', │ │ │ │ │ │ tool_name='brave_search' │ │ │ │ │ ) │ │ │ │ ] │ │ │ ), │ │ │ step_id='81c16bd3-eb00-4721-8edc-f386e07391a3', │ │ │ step_type='inference', │ │ │ turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 11, 2, 44, 637149, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 11, 2, 43, 915831, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ToolExecutionStep( │ │ │ step_id='4782d609-a62e-45f5-8d2a-25a43db46288', │ │ │ step_type='tool_execution', │ │ │ tool_calls=[ │ │ │ │ ToolCall( │ │ │ │ │ arguments={'query': 'latest news llama stack'}, │ │ │ │ │ call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b', │ │ │ │ │ tool_name='brave_search' │ │ │ │ ) │ │ │ ], │ │ │ tool_responses=[ │ │ │ │ ToolResponse( │ │ │ │ │ call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b', │ │ │ │ │ content='{"query": "latest news llama stack", "top_k": [{"title": "Llama 3.2: Revol. ....... Hacker News.", "score": 0.6186197, "raw_content": null}]}', │ │ │ │ │ tool_name='brave_search', │ │ │ │ │ metadata=None │ │ │ │ ) │ │ │ ], │ │ │ turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 11, 2, 46, 272176, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 11, 2, 44, 640743, tzinfo=TzInfo(-08:00)) │ │ ), │ │ InferenceStep( │ │ │ api_model_response=CompletionMessage( │ │ │ │ content="The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.", │ │ │ │ role='assistant', │ │ │ │ stop_reason='end_of_turn', │ │ │ │ tool_calls=[] │ │ │ ), │ │ │ step_id='37994419-5da3-4e84-a010-8d9b85366262', │ │ │ step_type='inference', │ │ │ turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 11, 2, 48, 961275, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 11, 2, 46, 273168, tzinfo=TzInfo(-08:00)) │ │ ) │ ], │ turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45', │ completed_at=datetime.datetime(2025, 2, 27, 11, 2, 48, 962318, tzinfo=TzInfo(-08:00)), │ output_attachments=[] ) ``` ## Check for Telemetry ```python agent_logs = [] for span in client.telemetry.query_spans( attribute_filters=[ {"key": "session_id", "op": "eq", "value": session_id}, ], attributes_to_return=['input', 'output'], ): agent_logs.append(span.attributes) pprint(json.loads(agent_logs[-1]['output'])) ``` ``` { │ 'content': "The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.", │ 'tool_calls': [] } ```	2025-02-27 23:06:37 -08:00
Hardik Shah	999195fe5b	fix: [Litellm]Do not swallow first token (#1316 ) `ChatCompletionResponseEventType: start` is ignored and not yielded in the agent_instance as we expect that to not have any content. However, litellm sends first event as `ChatCompletionResponseEventType: start` with content ( which was the first token that we were skipping ) ``` LLAMA_STACK_CONFIG=dev pytest -s -v tests/client-sdk/agents/test_agents.py --inference-model "openai/gpt-4o-mini" -k test_agent_simple ``` This was failing before ( since the word hello was not in the final response )	2025-02-27 20:53:47 -08:00
Xi Yan	076d2f349d	fix: litellm tool call parsing event type to in_progress (#1312 ) # What does this PR do? - Test with script: https://gist.github.com/yanxi0830/64699f3604766ac2319421b750c5bf9c - Agent with tool calls does not get correctly parsed with LiteLLM provider b/c we skip processing `ChatCompletionResponseEventType.complete`. - However, LiteLLM spits out event_type="complete" with ToolCallDelta `2f7683bc5f/llama_stack/providers/inline/agents/meta_reference/agent_instance.py (L570-L577)` - Llama Model ``` ChatCompletionResponseStreamChunk( │ event=Event( │ │ delta=ToolCallDelta( │ │ │ parse_status='succeeded', │ │ │ tool_call=ToolCall( │ │ │ │ arguments={'kind': 'pod', 'namespace': 'openshift-lightspeed'}, │ │ │ │ call_id='call_tIjWTUdsQXhQ2XHC5ke4EQY5', │ │ │ │ tool_name='get_object_namespace_list' │ │ │ ), │ │ │ type='tool_call' │ │ ), │ │ event_type='progress', │ │ logprobs=None, │ │ stop_reason='end_of_turn' │ ), │ metrics=None ) ChatCompletionResponseStreamChunk( │ event=Event( │ │ delta=TextDelta(text='', type='text'), │ │ event_type='complete', │ │ logprobs=None, │ │ stop_reason='end_of_turn' │ ), │ metrics=None ) ``` - LiteLLM model ``` ChatCompletionResponseStreamChunk( │ event=Event( │ │ delta=ToolCallDelta( │ │ │ parse_status='succeeded', │ │ │ tool_call=ToolCall( │ │ │ │ arguments={'kind': 'pod', 'namespace': 'openshift-lightspeed'}, │ │ │ │ call_id='call_tIjWTUdsQXhQ2XHC5ke4EQY5', │ │ │ │ tool_name='get_object_namespace_list' │ │ │ ), │ │ │ type='tool_call' │ │ ), │ │ event_type='complete', │ │ logprobs=None, │ │ stop_reason='end_of_turn' │ ), │ metrics=None ) ChatCompletionResponseStreamChunk( │ event=Event( │ │ delta=TextDelta(text='', type='text'), │ │ event_type='complete', │ │ logprobs=None, │ │ stop_reason='end_of_turn' │ ), │ metrics=None ) ``` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan - Test with script: https://gist.github.com/yanxi0830/64699f3604766ac2319421b750c5bf9c [//]: # (## Documentation)	2025-02-27 18:00:27 -08:00
Hardik Shah	2f7683bc5f	fix: Structured outputs for recursive models (#1311 ) Handle recursive nature in the structured response_formats. Update test to include 1 nested model. ``` LLAMA_STACK_CONFIG=dev pytest -s -v tests/client-sdk/inference/test_text_inference.py --inference-model "openai/gpt-4o-mini" -k test_text_chat_completion_structured_output ``` --------- Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-02-27 17:31:53 -08:00
Matthew Farrellee	e28cedd833	feat: add nvidia embedding implementation for new signature, task_type, output_dimention, text_truncation (#1213 ) # What does this PR do? updates nvidia inference provider's embedding implementation to use new signature add support for task_type, output_dimensions, text_truncation parameters ## Test Plan `LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/inference/test_embedding.py --embedding-model baai/bge-m3`	2025-02-27 16:58:11 -08:00
Luis Tomas Bolivar	73c6f6126f	fix: Avoid unexpected keyword argument for sentence_transformers (#1269 ) Now that remote-vllm include inline::sentence_transformers there is an issue building the image: Error building stack: SentenceTransformersInferenceConfig.sample_run_config() got an unexpected keyword argument '__distro_dir__' To avoid that issue this fix extends the sample_run_config to accept extra kwargs	2025-02-27 16:47:26 -08:00
Ashwin Bharambe	04de2f84e9	fix: register provider model name and HF alias in run.yaml (#1304 ) Each model known to the system has two identifiers: - the `provider_resource_id` (what the provider calls it) -- e.g., `accounts/fireworks/models/llama-v3p1-8b-instruct` - the `identifier` (`model_id`) under which it is registered and gets routed to the appropriate provider. We have so far used the HuggingFace repo alias as the standardized identifier you can use to refer to the model. So in the above example, we'd use `meta-llama/Llama-3.1-8B-Instruct` as the name under which it gets registered. This makes it convenient for users to refer to these models across providers. However, we forgot to register the _actual_ provider model ID also. You should be able to route via `provider_resource_id` also, of course. This change fixes this (somewhat grave) omission. Note: this change is additive -- more aliases work now compared to before. ## Test Plan Run the following for distro=(ollama fireworks together) ``` LLAMA_STACK_CONFIG=$distro \ pytest -s -v tests/client-sdk/inference/test_text_inference.py \ --inference-model=meta-llama/Llama-3.1-8B-Instruct --vision-inference-model="" ```	2025-02-27 16:39:23 -08:00
ehhuang	a34f3aafcf	fix: don't include tool args not in the function definition (#1307 ) # Summary: Right now we would include toolgroup args when we encode messages with tool_calls, which is confusing the model since they not in the function description (see test plan for example). # Test Plan: Add a print statement before raw prompt is sent to providers (no good way to test this currently) Before: ``` cated in the same neighborhood?<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>\n\n[knowledge_search(query="Laleli Mosque and Esma Sultan Mansion same neighborhood", vector_db_ids=["829a68735d744dc3830409dcc782964a"])]<\|eot_id\|><\|start_header_id\|>ipython<\|end_header_id\|>\n\nknowledge_search tool found 5 chunks:\nBEGIN of ``` Note the extra `vector_db_ids` After ``` >user<\|end_header_id\|>\n\nAre the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>\n\n[knowledge_search(query="Laleli Mosque and Esma Sultan Mansion same neighborhood")]<\|eot_id\|><\|start_header_id\|>ipython<\|end_header_id\|>\n\nknowledge_search tool found ```	2025-02-27 16:25:30 -08:00
Xi Yan	663c6b0537	fix: duplicate ToolResponseMessage in Turn message history (#1305 ) # What does this PR do? - Reproduce with: https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/e2e_loop_with_client_tools.py - Root cause: when we have ToolResponseMessage as part of Turn, we will create duplicate ToolResponseMessage in the conversation history when getting messages from a Turn. - Fix: avoid adding duplicate ToolResponseMessage from a turn's input_messages. - If it is part of a Turn's steps, only add it when processing the steps. - If it is not part of a Turn's steps, add it. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/agents/test_agents.py --inference-model meta-llama/Llama-3.1-8B-Instruct ``` ``` python -m examples.agents.e2e_loop_with_client_tools localhost 8321 ``` ```python Turn( │ input_messages=[ │ │ UserMessage( │ │ │ content='What was the closing price of Google stock (ticker symbol GOOG) for 2023 ?', │ │ │ role='user', │ │ │ context=None │ │ ), │ │ ToolResponseMessage( │ │ │ call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94', │ │ │ content='"[{\\"(\'Year\', \'\')\\":2023,\\"(\'Close\', \'GOOG\')\\":140.4254302979}]"', │ │ │ role='tool', │ │ │ tool_name='get_ticker_data' │ │ ) │ ], │ output_message=CompletionMessage( │ │ content='Note: The actual closing price for 2023 may not be available or may be different from the result obtained above. The result is based on a hypothetical call to the get_ticker_data function.', │ │ role='assistant', │ │ stop_reason='end_of_turn', │ │ tool_calls=[] │ ), │ session_id='4c791107-f0d8-456e-a27f-aa2fdc72b871', │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 412928, tzinfo=TzInfo(-08:00)), │ steps=[ │ │ ShieldCallStep( │ │ │ step_id='e0514587-b7d6-4bba-8609-8e05a3a46d8a', │ │ │ step_type='shield_call', │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 858382, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 425204, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ), │ │ InferenceStep( │ │ │ api_model_response=CompletionMessage( │ │ │ │ content='', │ │ │ │ role='assistant', │ │ │ │ stop_reason='end_of_turn', │ │ │ │ tool_calls=[ │ │ │ │ │ ToolCall( │ │ │ │ │ │ arguments={ │ │ │ │ │ │ │ 'ticker_symbol': 'GOOG', │ │ │ │ │ │ │ 'start': '2023-01-01', │ │ │ │ │ │ │ 'end': '2023-12-31' │ │ │ │ │ │ }, │ │ │ │ │ │ call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94', │ │ │ │ │ │ tool_name='get_ticker_data' │ │ │ │ │ ) │ │ │ │ ] │ │ │ ), │ │ │ step_id='a3ceec6a-f149-49d5-a1c2-db461e3f6e9f', │ │ │ step_type='inference', │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 26, 910179, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 871130, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ShieldCallStep( │ │ │ step_id='f9339865-96ca-4425-af42-a87bab343e24', │ │ │ step_type='shield_call', │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 383013, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 26, 944012, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ), │ │ ToolExecutionStep( │ │ │ step_id='e317b74a-c4f3-4845-99a3-7d93aa6ea6c8', │ │ │ step_type='tool_execution', │ │ │ tool_calls=[ │ │ │ │ ToolCall( │ │ │ │ │ arguments={'ticker_symbol': 'GOOG', 'start': '2023-01-01', 'end': '2023-12-31'}, │ │ │ │ │ call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94', │ │ │ │ │ tool_name='get_ticker_data' │ │ │ │ ) │ │ │ ], │ │ │ tool_responses=[ │ │ │ │ ToolResponse( │ │ │ │ │ call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94', │ │ │ │ │ content='"[{\\"(\'Year\', \'\')\\":2023,\\"(\'Close\', \'GOOG\')\\":140.4254302979}]"', │ │ │ │ │ tool_name='get_ticker_data', │ │ │ │ │ metadata=None │ │ │ │ ) │ │ │ ], │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 718810, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 26, 943792, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ShieldCallStep( │ │ │ step_id='c4236616-db89-4c04-ad04-f51cfb726385', │ │ │ step_type='shield_call', │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 958946, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 732680, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ), │ │ InferenceStep( │ │ │ api_model_response=CompletionMessage( │ │ │ │ content='Note: The actual closing price for 2023 may not be available or may be different from the result obtained above. The result is based on a hypothetical call to the get_ticker_data function.', │ │ │ │ role='assistant', │ │ │ │ stop_reason='end_of_turn', │ │ │ │ tool_calls=[] │ │ │ ), │ │ │ step_id='3386f896-2026-41e4-a60f-f6f3c3981cf6', │ │ │ step_type='inference', │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 74750, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 970724, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ShieldCallStep( │ │ │ step_id='bc57ac8c-f94e-4758-bf1a-0dd734eca1cf', │ │ │ step_type='shield_call', │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 443016, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 86726, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ) │ ], │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 459456, tzinfo=TzInfo(-08:00)), │ output_attachments=[] ) ``` ```python Turn( │ input_messages=[ │ │ UserMessage(content='What is 40+30?', role='user', context=None), │ │ ToolResponseMessage( │ │ │ call_id='8e54aca9-244d-44ca-ada0-0365090e8622', │ │ │ content='{"success": true, "result": 70.0}', │ │ │ role='tool', │ │ │ tool_name='calculator' │ │ ) │ ], │ output_message=CompletionMessage( │ │ content='The result of the calculation is 70.', │ │ role='assistant', │ │ stop_reason='end_of_turn', │ │ tool_calls=[] │ ), │ session_id='4c791107-f0d8-456e-a27f-aa2fdc72b871', │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 156903, tzinfo=TzInfo(-08:00)), │ steps=[ │ │ ShieldCallStep( │ │ │ step_id='17b6b645-31cc-4be9-a758-a4f3b741ced9', │ │ │ step_type='shield_call', │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 780564, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 174515, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ), │ │ InferenceStep( │ │ │ api_model_response=CompletionMessage( │ │ │ │ content='', │ │ │ │ role='assistant', │ │ │ │ stop_reason='end_of_turn', │ │ │ │ tool_calls=[ │ │ │ │ │ ToolCall( │ │ │ │ │ │ arguments={'x': 40.0, 'y': 30.0, 'operation': 'add'}, │ │ │ │ │ │ call_id='8e54aca9-244d-44ca-ada0-0365090e8622', │ │ │ │ │ │ tool_name='calculator' │ │ │ │ │ ) │ │ │ │ ] │ │ │ ), │ │ │ step_id='f59e951a-2b75-497d-a075-ec9aad9aad12', │ │ │ step_type='inference', │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 141869, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 792047, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ShieldCallStep( │ │ │ step_id='efafa0cf-23b9-4a90-8350-3a186d80925d', │ │ │ step_type='shield_call', │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 766293, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 177473, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ), │ │ ToolExecutionStep( │ │ │ step_id='877cfbe7-57a8-4056-9c29-49aa38dd337c', │ │ │ step_type='tool_execution', │ │ │ tool_calls=[ │ │ │ │ ToolCall( │ │ │ │ │ arguments={'x': 40.0, 'y': 30.0, 'operation': 'add'}, │ │ │ │ │ call_id='8e54aca9-244d-44ca-ada0-0365090e8622', │ │ │ │ │ tool_name='calculator' │ │ │ │ ) │ │ │ ], │ │ │ tool_responses=[ │ │ │ │ ToolResponse( │ │ │ │ │ call_id='8e54aca9-244d-44ca-ada0-0365090e8622', │ │ │ │ │ content='{"success": true, "result": 70.0}', │ │ │ │ │ tool_name='calculator', │ │ │ │ │ metadata=None │ │ │ │ ) │ │ │ ], │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 930899, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 177202, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ShieldCallStep( │ │ │ step_id='d47c6160-45d9-47c1-8e39-2faae65ee468', │ │ │ step_type='shield_call', │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 3, 510402, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 949433, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ), │ │ InferenceStep( │ │ │ api_model_response=CompletionMessage( │ │ │ │ content='The result of the calculation is 70.', │ │ │ │ role='assistant', │ │ │ │ stop_reason='end_of_turn', │ │ │ │ tool_calls=[] │ │ │ ), │ │ │ step_id='660ba1cc-770e-471c-bf6e-11e103d74443', │ │ │ step_type='inference', │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 4, 814944, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 3, 521309, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ShieldCallStep( │ │ │ step_id='4dab8bb0-7d38-4465-ae1a-10069de2b3d1', │ │ │ step_type='shield_call', │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 5, 428561, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 4, 825970, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ) │ ], │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 5, 462823, tzinfo=TzInfo(-08:00)), │ output_attachments=[] ) ``` [//]: # (## Documentation)	2025-02-27 15:06:47 -08:00
Ashwin Bharambe	4780223544	fix: groq now depends on litellm	2025-02-27 14:07:12 -08:00
Ashwin Bharambe	928a39d17b	feat(providers): Groq now uses LiteLLM openai-compat (#1303 ) Groq has never supported raw completions anyhow. So this makes it easier to switch it to LiteLLM. All our test suite passes. I also updated all the openai-compat providers so they work with api keys passed from headers. `provider_data` ## Test Plan ```bash LLAMA_STACK_CONFIG=groq \ pytest -s -v tests/client-sdk/inference/test_text_inference.py \ --inference-model=groq/llama-3.3-70b-versatile --vision-inference-model="" ``` Also tested (openai, anthropic, gemini) providers. No regressions.	2025-02-27 13:16:50 -08:00
Xi Yan	564f0e5f93	fix: Revert "chore: remove vector_db_id from AgentSessionInfo" (#1299 ) Reverts meta-llama/llama-stack#1296 This change breaks test: `session_info.vector_db_id` is actually used ``` pytest -v tests/client-sdk/agents/test_agents.py::test_rag_and_code_agent --inference-model meta-llama/Llama-3.1-8B-Instruct ```	2025-02-27 10:37:15 -08:00
Xi Yan	200ef29233	chore: remove vector_db_id from AgentSessionInfo (#1296 ) # What does this PR do? - It is not being used anywhere and doesn't make sense to have 1 single vector_db_id in an agent session. No top level API change. - See https://github.com/meta-llama/llama-stack/pull/1286#discussion_r1972569881 [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan - See https://github.com/meta-llama/llama-stack/pull/1286#discussion_r1972569881 [//]: # (## Documentation)	2025-02-27 10:13:10 -08:00
Xi Yan	fc5aff3ccf	feat: ability to retrieve agents session, turn, step by ids (#1286 ) # What does this PR do? - Fix up rotten implementation for retrieving agent's Session, Turn, Step with actual working implementation. - Update `getting_started` notebook with retrieving by agent session_id. https://github.com/meta-llama/llama-stack/blob/export_agent_dataset/docs/getting_started.ipynb [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Test with script: https://gist.github.com/yanxi0830/657cecee8f1f0e39d322963d9c0f598e <img width="503" alt="image" src="https://github.com/user-attachments/assets/5ea9bc33-83d1-40bc-98e1-b68393158387" /> [//]: # (## Documentation)	2025-02-27 09:45:14 -08:00
ehhuang	0762c61402	feat: don't silently ignore incorrect toolgroup (#1285 )	2025-02-27 08:11:09 -05:00
Matthew Farrellee	99b6925ad8	feat: add nemo retriever text embedding models to nvidia inference provider (#1218 ) # What does this PR do? add the NeMo Retriever Embedding models from https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/support-matrix.html	2025-02-26 21:18:34 -08:00
Ashwin Bharambe	23b65b6cee	fix(test): update client-sdk tests to handle tool format parametrization better (#1287 ) # What does this PR do? Tool format depends on the model. @ehhuang introduced a `get_default_tool_prompt_format` function for this purpose. We should use that instead of hacky model ID matching we had before. Secondly, non llama models don't have this concept so testing with those models should work as is. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ```bash for distro in fireworks ollama; do LLAMA_STACK_CONFIG=$distro \ pytest -s -v tests/client-sdk/inference/test_text_inference.py \ --inference-model=meta-llama/Llama-3.2-3B-Instruct \ --vision-inference-model="" done LLAMA_STACK_CONFIG=dev \ pytest -s -v tests/client-sdk/inference/test_text_inference.py \ --inference-model=openai/gpt-4o \ --vision-inference-model="" ``` [//]: # (## Documentation)	2025-02-26 21:16:00 -08:00
Ihar Hrachyshka	2250ab7274	fix: don't attempt to clean gpu memory up when device is cpu (#1191 ) This is a follow up to: https://github.com/meta-llama/llama-stack/pull/1140 Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com> # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] Avoid unnecessary GPU memory clean attempt when the GPU is not used for training. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan With CPU: ``` INFO 2025-02-26 16:43:56,267 torchtune.utils._logging:121: Model checkpoint of size 6.43 GB saved to /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth INFO 2025-02-26 16:43:56,274 torchtune.utils._logging:132: Adapter checkpoint of size 0.00 GB saved to /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth model_file_path /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0 ``` With CUDA: ``` INFO 2025-02-26 21:39:24,314 torchtune.utils._logging:121: Model checkpoint of size 6.43 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth INFO 2025-02-26 21:39:24,333 torchtune.utils._logging:132: Adapter checkpoint of size 0.00 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth model_file_path /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0 ``` [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-02-26 15:12:11 -08:00
ehhuang	270d64007a	fix: sqlite conn (#1282 ) # Summary: Our tests sometimes error out with ``` ========================== 11 passed, 342 warnings in 58.86s ========================== Error exporting span to SQLite: Cannot operate on a closed database. Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads Python runtime state: finalizing (tstate=0x000000012af04280) Current thread 0x00000001fa29c240 (most recent call first): <no Python frame> ``` Usually able to repro this by running 10 times. The proposed fix is to use threadsafe var for creating sqlite connection to ensure connection is only used by one thread. Not 100% if this is the fix, but am not able to repro with this. # Test Plan: Run 10 times and saw no more errors ``` for i in {1..10}; do echo "=== Starting Run $i ===" LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B if [[ $? -ne 0 ]]; then echo "=== Run $i FAILED with exit code $? ===" break else echo "=== Run $i PASSED ===" fi echo done ```	2025-02-26 14:44:31 -08:00
ehhuang	c8a20b8ed0	feat: allow specifying specific tool within toolgroup (#1239 ) Summary: E.g. `builtin::rag::knowledge_search` Test Plan: ``` LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/ --safety-shield meta-llama/Llama-Guard-3-8B ```	2025-02-26 14:07:05 -08:00
ehhuang	fca84db5b0	fix: time logging format (#1281 ) Summary: missed in last PR Test Plan: ``` LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py::test_create_turn_response --safety-shield meta-llama/Llama-Guard-3-8B ```	2025-02-26 13:51:33 -08:00
ehhuang	bb2690f176	feat: remove special handling of builtin::rag tool (#1015 ) Summary: Lets the model decide which tool it needs to call to respond to a query. Test Plan: ``` LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/ --safety-shield meta-llama/Llama-Guard-3-8B ``` Also evaluated on a small benchmark with 20 questions from HotpotQA. With this PR and some prompting, the performance is 77% recall compared to 50% currently. --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1015). * #1268 * #1239 * __->__ #1015	2025-02-26 13:04:52 -08:00

1 2 3 4 5 ...

620 commits