llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-05 02:17:31 +00:00

Author	SHA1	Message	Date
Xi Yan	43044f29e2	fix: fix llama stack run with missing agent impl (#1559 ) # What does this PR do? - recent merge https://github.com/meta-llama/llama-stack/pull/1410 introduce error ``` ValueError: Provider meta-reference (Api.agents) does not implement the following methods: [('list_agent_sessions', 'not_actually_implemented'), ('list_agents', 'not_actually_implemented')] ``` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` llama stack run ``` ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/agents/test_agents.py --text-model meta-llama/Llama-3.3-70B-Instruct ``` `1379530386` [//]: # (## Documentation)	2025-03-11 11:22:22 -07:00
Ihar Hrachyshka	c3d7d17bc4	chore: fix typing hints for get_provider_impl deps arguments (#1544 ) # What does this PR do? It's a dict that may contain different types, as per resolver:instantiate_provider implementation. (AFAIU it also never contains ProviderSpecs, but instances of provider implementations.) [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan mypy passing if enabled checks for these modules. (See #1543) [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-03-11 10:07:28 -07:00
Ihar Hrachyshka	0e73186a11	fix: Add missing shutdown handler for TorchtunePostTrainingImpl (#1535 ) # What does this PR do? Added missing shutdown handler. (Currently empty.) Without it, when server shuts down, it posts the following warning: ``` __main__:129 server: No shutdown method for TorchtunePostTrainingImpl ``` Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan (The test plan assumes shutdown logic is fixed, see #1495) Without the patch: ``` INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) INFO: Shutting down INFO: Waiting for application shutdown. INFO 2025-03-10 20:56:43,961 __main__:140 server: Shutting down INFO 2025-03-10 20:56:43,962 __main__:124 server: Shutting down DatasetsRoutingTable INFO 2025-03-10 20:56:43,964 __main__:124 server: Shutting down DatasetIORouter INFO 2025-03-10 20:56:43,965 __main__:124 server: Shutting down ScoringFunctionsRoutingTable INFO 2025-03-10 20:56:43,966 __main__:124 server: Shutting down ScoringRouter INFO 2025-03-10 20:56:43,967 __main__:124 server: Shutting down ModelsRoutingTable INFO 2025-03-10 20:56:43,968 __main__:124 server: Shutting down InferenceRouter INFO 2025-03-10 20:56:43,969 __main__:124 server: Shutting down ShieldsRoutingTable INFO 2025-03-10 20:56:43,971 __main__:124 server: Shutting down SafetyRouter INFO 2025-03-10 20:56:43,972 __main__:124 server: Shutting down VectorDBsRoutingTable INFO 2025-03-10 20:56:43,973 __main__:124 server: Shutting down VectorIORouter INFO 2025-03-10 20:56:43,974 __main__:124 server: Shutting down ToolGroupsRoutingTable INFO 2025-03-10 20:56:43,975 __main__:124 server: Shutting down ToolRuntimeRouter INFO 2025-03-10 20:56:43,976 __main__:124 server: Shutting down MetaReferenceAgentsImpl INFO 2025-03-10 20:56:43,977 __main__:124 server: Shutting down TelemetryAdapter INFO 2025-03-10 20:56:43,978 __main__:124 server: Shutting down TorchtunePostTrainingImpl WARNING 2025-03-10 20:56:43,979 __main__:129 server: No shutdown method for TorchtunePostTrainingImpl INFO 2025-03-10 20:56:43,979 __main__:124 server: Shutting down BenchmarksRoutingTable INFO 2025-03-10 20:56:43,980 __main__:124 server: Shutting down EvalRouter INFO 2025-03-10 20:56:43,981 __main__:124 server: Shutting down DistributionInspectImpl INFO: Application shutdown complete. INFO: Finished server process [33862] ``` Run with the patch and observe no warning: ``` $ kill -INT $(ps ax \| grep llama_stack.distribution.server.server \| grep -v nvim \| awk -e '{print $1}' \| sort \| head -n 1) ``` ``` INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) INFO: Shutting down INFO: Waiting for application shutdown. INFO 2025-03-11 00:32:56,863 __main__:140 server: Shutting down INFO 2025-03-11 00:32:56,864 __main__:124 server: Shutting down DatasetsRoutingTable INFO 2025-03-11 00:32:56,866 __main__:124 server: Shutting down DatasetIORouter INFO 2025-03-11 00:32:56,867 __main__:124 server: Shutting down ScoringFunctionsRoutingTable INFO 2025-03-11 00:32:56,868 __main__:124 server: Shutting down ScoringRouter INFO 2025-03-11 00:32:56,869 __main__:124 server: Shutting down ModelsRoutingTable INFO 2025-03-11 00:32:56,870 __main__:124 server: Shutting down InferenceRouter INFO 2025-03-11 00:32:56,871 __main__:124 server: Shutting down ShieldsRoutingTable INFO 2025-03-11 00:32:56,872 __main__:124 server: Shutting down SafetyRouter INFO 2025-03-11 00:32:56,873 __main__:124 server: Shutting down VectorDBsRoutingTable INFO 2025-03-11 00:32:56,874 __main__:124 server: Shutting down VectorIORouter INFO 2025-03-11 00:32:56,875 __main__:124 server: Shutting down ToolGroupsRoutingTable INFO 2025-03-11 00:32:56,876 __main__:124 server: Shutting down ToolRuntimeRouter INFO 2025-03-11 00:32:56,877 __main__:124 server: Shutting down MetaReferenceAgentsImpl INFO 2025-03-11 00:32:56,878 __main__:124 server: Shutting down TelemetryAdapter INFO 2025-03-11 00:32:56,879 __main__:124 server: Shutting down TorchtunePostTrainingImpl INFO 2025-03-11 00:32:56,880 __main__:124 server: Shutting down BenchmarksRoutingTable INFO 2025-03-11 00:32:56,881 __main__:124 server: Shutting down EvalRouter INFO 2025-03-11 00:32:56,882 __main__:124 server: Shutting down DistributionInspectImpl ``` [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-03-11 10:01:09 -07:00
Dinesh Yeduguru	ead9397e22	fix: tracing fixes for trace context propogation across coroutines (#1522 ) # What does this PR do? This PR has two fixes needed for correct trace context propagation across asycnio boundary Fix 1: Start using context vars to store the global trace context. This is needed since we cannot use the same trace context across coroutines since the state is shared. each coroutine should have its own trace context so that each of it can start storing its state correctly. Fix 2: Start a new span for each new coroutines started for running shields to keep the span tree clean ## Test Plan ### Integration tests with server LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/together/together-run.yaml LLAMA_STACK_CONFIG=http://localhost:8321 pytest -s --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct server logs: https://gist.github.com/dineshyv/51ac5d9864ed031d0d89ce77352821fe test logs: https://gist.github.com/dineshyv/e66acc1c4648a42f1854600609c467f3 ### Integration tests with library client LLAMA_STACK_CONFIG=fireworks pytest -s --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct logs: https://gist.github.com/dineshyv/ca160696a0b167223378673fb1dcefb8 ### Apps test with server: ``` LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/together/together-run.yaml python -m examples.agents.e2e_loop_with_client_tools localhost 8321 ``` server logs: https://gist.github.com/dineshyv/1717a572d8f7c14279c36123b79c5797 app logs: https://gist.github.com/dineshyv/44167e9f57806a0ba3b710c32aec02f8	2025-03-11 07:12:48 -07:00
Botao Chen	e3edca7739	feat: [new open benchmark] Math 500 (#1538 ) ## What does this PR do? Created a new math_500 open-benchmark based on OpenAI's [Let's Verify Step by Step](https://arxiv.org/abs/2305.20050) paper and hugging face's [HuggingFaceH4/MATH-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) dataset. The challenge part of this benchmark is to parse the generated and expected answer and verify if they are same. For the parsing part, we refer to [Minerva: Solving Quantitative Reasoning Problems with Language Models](https://research.google/blog/minerva-solving-quantitative-reasoning-problems-with-language-models/). To simply the parse logic, as the next step, we plan to also refer to what [simple-eval](https://github.com/openai/simple-evals) is doing, using llm as judge to check if the generated answer matches the expected answer or not ## Test Plan on sever side, spin up a server with open-benchmark template `llama stack run llama_stack/templates/open-benchamrk/run.yaml` on client side, issue an open benchmark eval request `llama-stack-client --endpoint xxx eval run-benchmark "meta-reference-math-500" --model-id "meta-llama/Llama-3.3-70B-Instruct" --output-dir "/home/markchen1015/" --num-examples 20` and get ther aggregated eval results <img width="238" alt="Screenshot 2025-03-10 at 7 57 04 PM" src="https://github.com/user-attachments/assets/2c9da042-3b70-470e-a7c4-69f4cc24d1fb" /> check the generated answer and the related scoring and they make sense	2025-03-10 20:38:28 -07:00
Sarthak Deshpande	a9c5d3cd3d	chore: made inbuilt tools blocking calls into async non blocking calls (#1509 ) # What does this PR do? This PR converts blocking calls for in built tools like wolfram, brave, tavily and bing into non blocking async calls [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] pytest -s -v tool_runtime/test_builtin_tools.py --stack-config=together --text-model=meta-llama/Llama-3.1-8B-Instruct Used the command above to get the below results <img width="1710" alt="image" src="https://github.com/user-attachments/assets/76b0ca06-f6e4-45fa-a114-0449bef2325b" /> <img width="1389" alt="image" src="https://github.com/user-attachments/assets/5220ccbb-7882-4240-b17e-f362ad46d25b" /> <img width="1432" alt="image" src="https://github.com/user-attachments/assets/bb93a41e-e82a-4c98-a22d-6b0e320aa974" /> [//]: # (## Documentation) --------- Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>	2025-03-09 16:59:24 -07:00
ehhuang	23e39cc3c4	fix: handle log errors (#1499 ) Summary: \| File "/Users/erichuang/projects/llama-stack/llama_stack/distribution/server/server.py", line 213, in sse_generator \| logger.exception(f"Error in sse_generator: {e}") \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 1864, in exception \| self.log(ERROR, msg, args, exc_info=exc_info, kwargs) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 1879, in log \| self.logger.log(level, msg, args, kwargs) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 1547, in log \| self._log(level, msg, args, kwargs) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 1624, in _log \| self.handle(record) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 1634, in handle \| self.callHandlers(record) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 1696, in callHandlers \| hdlr.handle(record) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 968, in handle \| self.emit(record) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py", line 167, in emit \| message_renderable = self.render_message(record, message) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py", line 193, in render_message \| message_text = Text.from_markup(message) if use_markup else Text(message) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/text.py", line 287, in from_markup \| rendered_text = render(text, style, emoji=emoji, emoji_variant=emoji_variant) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/markup.py", line 167, in render \| raise MarkupError( \| rich.errors.MarkupError: closing tag '[/INST]' at position 105 doesn't match any open tag Test Plan: reran failing rag_with_vector_db example	2025-03-07 15:58:26 -08:00
Fred Reiss	a8d0cdaf37	feat: updated inline vllm inference provider (#880 ) # What does this PR do? This PR updates the inline vLLM inference provider in several significant ways: * Models are now attached at run time to instances of the provider via the `.../models` API instead of hard-coding the model's full name into the provider's YAML configuration. * The provider supports models that are not Meta Llama models. Any model that vLLM supports can be loaded by passing Huggingface coordinates in the "provider_model_id" field. Custom fine-tuned versions of Meta Llama models can be loaded by specifying a path on local disk in the "provider_model_id". * To implement full chat completions support, including tool calling and constrained decoding, the provider now routes the `chat_completions` API to a captive (i.e. called directly in-process, not via HTTPS) instance of vLLM's OpenAI-compatible server . * The `logprobs` parameter and completions API are also working. ## Test Plan Existing tests in `llama_stack/providers/tests/inference/test_text_inference.py` have good coverage of the new functionality. These tests can be invoked as follows: ``` cd llama-stack && pytest \ -vvv \ llama_stack/providers/tests/inference/test_text_inference.py \ --providers inference=vllm \ --inference-model meta-llama/Llama-3.2-3B-Instruct ====================================== test session starts ====================================== platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12 cachedir: .pytest_cache metadata: {'Python': '3.12.8', 'Platform': 'Linux-6.8.0-1016-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.2'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'} rootdir: /mnt/datadisk1/freiss/llama/llama-stack configfile: pyproject.toml plugins: anyio-4.8.0, html-4.1.1, metadata-3.1.1, asyncio-0.25.2 asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None collected 9 items llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm] PASSED [ 11%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm] PASSED [ 22%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm] PASSED [ 33%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm] PASSED [ 44%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm] PASSED [ 55%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm] PASSED [ 66%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm] PASSED [ 77%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm] PASSED [ 88%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm] PASSED [100%] =========================== 9 passed, 13 warnings in 97.18s (0:01:37) =========================== ``` ## Sources ## Before submitting - [X] Ran pre-commit to handle lint / formatting issues. - [X] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests. --------- Co-authored-by: Sébastien Han <seb@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-03-07 13:38:23 -08:00
ehhuang	acbae66b9d	chore: escape tool output for logging (#1490 ) Summary: error: llama_stack/providers/inline/agents/meta_reference/agent_instance.py:1032: in execute_tool_call_maybe logger.info(f"tool call {name} completed with result: {result}") /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1841: in info self.log(INFO, msg, args, kwargs) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1879: in log self.logger.log(level, msg, args, kwargs) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1547: in log self._log(level, msg, args, kwargs) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1624: in _log self.handle(record) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1634: in handle self.callHandlers(record) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1696: in callHandlers hdlr.handle(record) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:968: in handle self.emit(record) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py:167: in emit message_renderable = self.render_message(record, message) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py:193: in render_message message_text = Text.from_markup(message) if use_markup else Text(message) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/text.py:287: in from_markup rendered_text = render(text, style, emoji=emoji, emoji_variant=emoji_variant) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/markup.py:167: in render raise MarkupError( E rich.errors.MarkupError: closing tag '[/INST]' at position 3274 doesn't match any open tag Test Plan:	2025-03-07 13:33:45 -08:00
Sébastien Han	7cf1e24c4e	feat(logging): implement category-based logging (#1362 ) # What does this PR do? This commit introduces a new logging system that allows loggers to be assigned a category while retaining the logger name based on the file name. The log format includes both the logger name and the category, producing output like: ``` INFO 2025-03-03 21:44:11,323 llama_stack.distribution.stack:103 [core]: Tool_groups: builtin::websearch served by tavily-search ``` Key features include: - Category-based logging: Loggers can be assigned a category (e.g., "core", "server") when programming. The logger can be loaded like this: `logger = get_logger(name=__name__, category="server")` - Environment variable control: Log levels can be configured per-category using the `LLAMA_STACK_LOGGING` environment variable. For example: `LLAMA_STACK_LOGGING="server=DEBUG;core=debug"` enables DEBUG level for the "server" and "core" categories. - `LLAMA_STACK_LOGGING="all=debug"` sets DEBUG level globally for all categories and third-party libraries. This provides fine-grained control over logging levels while maintaining a clean and informative log format. The formatter uses the rich library which provides nice colors better stack traces like so: ``` ERROR 2025-03-03 21:49:37,124 asyncio:1758 [uncategorized]: unhandled exception during asyncio.run() shutdown task: <Task finished name='Task-16' coro=<handle_signal.<locals>.shutdown() done, defined at /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:146> exception=UnboundLocalError("local variable 'loop' referenced before assignment")> ╭────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────╮ │ /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:178 in shutdown │ │ │ │ 175 │ │ except asyncio.CancelledError: │ │ 176 │ │ │ pass │ │ 177 │ │ finally: │ │ ❱ 178 │ │ │ loop.stop() │ │ 179 │ │ │ 180 │ loop = asyncio.get_running_loop() │ │ 181 │ loop.create_task(shutdown()) │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ UnboundLocalError: local variable 'loop' referenced before assignment ``` Co-authored-by: Ashwin Bharambe <@ashwinb> Signed-off-by: Sébastien Han <seb@redhat.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml INFO 2025-03-03 21:55:35,918 __main__:365 [server]: Using config file: llama_stack/templates/ollama/run.yaml INFO 2025-03-03 21:55:35,925 __main__:378 [server]: Run configuration: INFO 2025-03-03 21:55:35,928 __main__:380 [server]: apis: - agents ``` [//]: # (## Documentation) --------- Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-03-07 11:34:30 -08:00
Dinesh Yeduguru	60e7f3d705	fix: Revert "feat: record token usage for inference API (#1300 )" (#1476 ) This reverts commit `b8535417e0`. Test plan: LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/together/together-run.yaml python -m examples.agents.e2e_loop_with_client_tools localhost 8321	2025-03-07 10:16:47 -08:00
Ashwin Bharambe	330cc9d09d	feat: add Milvus vectorDB (#1467 ) # What does this PR do? See https://github.com/meta-llama/llama-stack/pull/1171 which is the original PR. Author: @zc277584121 feat: add [Milvus](https://milvus.io/) vectorDB note: I use the MilvusClient to implement it instead of AsyncMilvusClient, because when I tested AsyncMilvusClient, it would raise issues about evenloop, which I think AsyncMilvusClient SDK is not robust enough to be compatible with llama_stack framework. ## Test Plan have passed the unit test and ene2end test Here is my end2end test logs, including the client code, client log, server logs from inline and remote settings [test_end2end_logs.zip](https://github.com/user-attachments/files/18964391/test_end2end_logs.zip) --------- Signed-off-by: ChengZi <chen.zhang@zilliz.com> Co-authored-by: Cheney Zhang <chen.zhang@zilliz.com>	2025-03-06 20:59:31 -08:00
Sébastien Han	803bf0e029	fix: solve ruff B008 warnings (#1444 ) # What does this PR do? The commit addresses the Ruff warning B008 by refactoring the code to avoid calling SamplingParams() directly in function argument defaults. Instead, it either uses Field(default_factory=SamplingParams) for Pydantic models or sets the default to None and instantiates SamplingParams inside the function body when the argument is None. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-03-06 16:48:35 -08:00
Xi Yan	bcb13c492f	test: revamp eval related integration tests (#1433 ) # What does this PR do? - revamp and clean up datasets/scoring/eval integration tests - closes https://github.com/meta-llama/llama-stack/issues/1396 [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan dataset ``` LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/integration/datasetio/ ``` <img width="842" alt="image" src="https://github.com/user-attachments/assets/88fc2b6a-b496-47bf-bc0c-8fea48ba36ff" /> scoring ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/scoring --text-model meta-llama/Llama-3.1-8B-Instruct --judge-model meta-llama/Llama-3.1-8B-Instruct ``` <img width="851" alt="image" src="https://github.com/user-attachments/assets/50f46415-b44c-4c37-a6c3-076f2767adb3" /> eval ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/eval --text-model meta-llama/Llama-3.1-8B-Instruct --judge-model meta-llama/Llama-3.1-8B-Instruct ``` <img width="841" alt="image" src="https://github.com/user-attachments/assets/8eb1c65c-3b39-4d66-8ff4-f471ca783e49" /> [//]: # (## Documentation)	2025-03-06 10:51:35 -08:00
Ashwin Bharambe	2fe976ed0a	refactor(test): introduce --stack-config and simplify options (#1404 ) You now run the integration tests with these options: ```bash Custom options: --stack-config=STACK_CONFIG a 'pointer' to the stack. this can be either be: (a) a template name like `fireworks`, or (b) a path to a run.yaml file, or (c) an adhoc config spec, e.g. `inference=fireworks,safety=llama-guard,agents=meta- reference` --env=ENV Set environment variables, e.g. --env KEY=value --text-model=TEXT_MODEL comma-separated list of text models. Fixture name: text_model_id --vision-model=VISION_MODEL comma-separated list of vision models. Fixture name: vision_model_id --embedding-model=EMBEDDING_MODEL comma-separated list of embedding models. Fixture name: embedding_model_id --safety-shield=SAFETY_SHIELD comma-separated list of safety shields. Fixture name: shield_id --judge-model=JUDGE_MODEL comma-separated list of judge models. Fixture name: judge_model_id --embedding-dimension=EMBEDDING_DIMENSION Output dimensionality of the embedding model to use for testing. Default: 384 --record-responses Record new API responses instead of using cached ones. --report=REPORT Path where the test report should be written, e.g. --report=/path/to/report.md ``` Importantly, if you don't specify any of the models (text-model, vision-model, etc.) the relevant tests will get skipped! This will make running tests somewhat more annoying since all options will need to be specified. We will make this easier by adding some easy wrapper yaml configs. ## Test Plan Example: ```bash ashwin@ashwin-mbp ~/local/llama-stack/tests/integration (unify_tests) $ LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/test_text_inference.py \ --text-model meta-llama/Llama-3.2-3B-Instruct ```	2025-03-05 17:02:02 -08:00
ehhuang	6cf79437b3	feat: support ClientTool output metadata (#1426 ) # Summary: Client side change in https://github.com/meta-llama/llama-stack-client-python/pull/180 Changes the resume_turn API to accept `ToolResponse` instead of `ToolResponseMessage`: 1. `ToolResponse` contains `metadata` 2. `ToolResponseMessage` is a concept for model inputs. Here we are just submitting the outputs of tool execution. # Test Plan: Ran integration tests with newly added test using client tool with metadata LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/integration/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --record-responses	2025-03-05 14:30:27 -08:00
Dinesh Yeduguru	b8535417e0	feat: record token usage for inference API (#1300 ) # What does this PR do? Inference router computes the token usage related metrics for all providers and returns the metrics as part of response and also logs to telemetry. ## Test Plan LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml ``` curl --request POST \ --url http://localhost:8321/v1/inference/chat-completion \ --header 'content-type: application/json' \ --data '{ "model_id": "meta-llama/Llama-3.1-70B-Instruct", "messages": [ { "role": "user", "content": { "type": "text", "text": "where do humans live" } } ], "stream": false }' \| jq . { "metrics": [ { "trace_id": "yjv1tf0jS1evOyPm", "span_id": "WqYKvg0_", "timestamp": "2025-02-27T18:55:10.770903Z", "attributes": { "model_id": "meta-llama/Llama-3.1-70B-Instruct", "provider_id": "fireworks" }, "type": "metric", "metric": "prompt_tokens", "value": 10, "unit": "tokens" }, { "trace_id": "yjv1tf0jS1evOyPm", "span_id": "WqYKvg0_", "timestamp": "2025-02-27T18:55:10.770916Z", "attributes": { "model_id": "meta-llama/Llama-3.1-70B-Instruct", "provider_id": "fireworks" }, "type": "metric", "metric": "completion_tokens", "value": 411, "unit": "tokens" }, { "trace_id": "yjv1tf0jS1evOyPm", "span_id": "WqYKvg0_", "timestamp": "2025-02-27T18:55:10.770919Z", "attributes": { "model_id": "meta-llama/Llama-3.1-70B-Instruct", "provider_id": "fireworks" }, "type": "metric", "metric": "total_tokens", "value": 421, "unit": "tokens" } ], "completion_message": { "role": "assistant", "content": "Humans live in various parts of the world, inhabiting almost every continent, country, and region. Here's a breakdown of where humans live:\n\n1. Continents: Humans inhabit all seven continents:\n\t* Africa\n\t* Antarctica (research stations only)\n\t* Asia\n\t* Australia\n\t* Europe\n\t* North America\n\t* South America\n2. Countries: There are 196 countries recognized by the United Nations, and humans live in almost all of them.\n3. Regions: Humans live in diverse regions, including:\n\t* Deserts (e.g., Sahara, Mojave)\n\t* Forests (e.g., Amazon, Congo)\n\t* Grasslands (e.g., Prairies, Steppes)\n\t* Mountains (e.g., Himalayas, Andes)\n\t* Oceans (e.g., coastal areas, islands)\n\t* Tundras (e.g., Arctic, sub-Arctic)\n4. Cities and towns: Many humans live in urban areas, such as cities and towns, which are often located near:\n\t* Coastlines\n\t* Rivers\n\t* Lakes\n\t* Mountains\n5. Rural areas: Some humans live in rural areas, such as:\n\t* Villages\n\t* Farms\n\t* Countryside\n6. Islands: Humans inhabit many islands, including:\n\t* Tropical islands (e.g., Hawaii, Maldives)\n\t* Arctic islands (e.g., Greenland, Iceland)\n\t* Continental islands (e.g., Great Britain, Ireland)\n7. Extreme environments: Humans also live in extreme environments, such as:\n\t* High-altitude areas (e.g., Tibet, Andes)\n\t* Low-altitude areas (e.g., Death Valley, Dead Sea)\n\t* Areas with extreme temperatures (e.g., Arctic, Sahara)\n\nOverall, humans have adapted to live in a wide range of environments and ecosystems around the world.", "stop_reason": "end_of_turn", "tool_calls": [] }, "logprobs": null } ``` ``` LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/integration/inference ======================================================================== short test summary info ========================================================================= FAILED tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B:vis=11B-inference:chat_completion:tool_calling_tools_absent-True] - ValueError: Unsupported tool prompt format: ToolPromptFormat.json FAILED tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B:vis=11B-inference:chat_completion:tool_calling_tools_absent-False] - ValueError: Unsupported tool prompt format: ToolPromptFormat.json FAILED tests/integration/inference/test_vision_inference.py::test_image_chat_completion_non_streaming[txt=8B:vis=11B] - fireworks.client.error.InvalidRequestError: {'error': {'object': 'error', 'type': 'invalid_request_error', 'message': 'Failed to decode image cannot identify image f... FAILED tests/integration/inference/test_vision_inference.py::test_image_chat_completion_streaming[txt=8B:vis=11B] - fireworks.client.error.InvalidRequestError: {'error': {'object': 'error', 'type': 'invalid_request_error', 'message': 'Failed to decode image cannot identify image f... ========================================================= 4 failed, 16 passed, 23 xfailed, 17 warnings in 44.36s ========================================================= ```	2025-03-05 12:41:45 -08:00
yyymeta	1c6fbd95a5	fix: regex parser to support more answer formats (#1425 ) # What does this PR do? add better-performance prompt: existing prompts expect a generated response that ends in "Answer :". But during test, we found that for GPQA, the prompt used by meta internal genEval "The best answer is [ABCD]" achieves higher accuracy . ## Test Plan ``` (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$llama-stack-client eval run-benchmark "meta-reference-gpqa-cot" --model-id meta-llama/Llama-4-17B-Llama-API --output-dir /tmp/gpqa --num-examples 20 .... Sending HTTP Request: GET http://localhost:5001/v1/scoring-functions/basic::regex_parser_multiple_choice_answer 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20/20 [ 0:04:46 < 0:00:00 , 0 it/s ] ✓ Results saved to: /tmp/gpqa/meta-reference-gpqa-cot_results.json! (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ (myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ tail /tmp/gpqa/meta-reference-gpqa-cot_results.json { "score": 0.0 }, { "accuracy": 0.5, "num_correct": 10.0, "num_total": 20 } ] }(myenv) [yyy@devgpu018.nha2 ~/internal-llama-stack (yyy)]$ ``` [//]: # (## Documentation)	2025-03-05 11:52:07 -08:00
Xi Yan	d3508c4c76	feat(1/n): scoring function registration for llm-as-judge (#1405 ) # What does this PR do? - add ability to register a llm-as-judge scoring function with custom judge prompts / params. - Closes https://github.com/meta-llama/llama-stack/issues/1395 [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Via CLI ``` llama-stack-client scoring_functions register \ --scoring-fn-id "llm-as-judge::my-prompt" \ --description "my custom judge" \ --return-type '{"type": "string"}' \ --provider-id "llm-as-judge" \ --provider-scoring-fn-id "my-prompt" \ --params '{"type": "llm_as_judge", "judge_model": "meta-llama/Llama-3.2-3B-Instruct", "prompt_template": "always output 1.0"}' ``` <img width="1373" alt="image" src="https://github.com/user-attachments/assets/7c6fc0ae-64fe-4581-8927-a9d8d746bd72" /> - Unit test will be addressed with https://github.com/meta-llama/llama-stack/issues/1396 [//]: # (## Documentation)	2025-03-05 10:00:34 -08:00
Daniele Martinoli	fb998683e0	fix: Agent uses the first configured vector_db_id when documents are provided (#1276 ) # What does this PR do? The agent API allows to query multiple DBs using the `vector_db_ids` argument of the `rag` tool: ```py toolgroups=[ { "name": "builtin::rag", "args": {"vector_db_ids": [vector_db_id]}, } ], ``` This means that multiple DBs can be used to compose an aggregated context by executing the query on each of them. When documents are passed to the next agent turn, there is no explicit way to configure the vector DB where the embeddings will be ingested. In such cases, we can assume that: - if any `vector_db_ids` is given, we use the first one (it probably makes sense to assume that it's the only one in the list, otherwise we should loop on all the given DBs to have a consistent ingestion) - if no `vector_db_ids` is given, we can use the current logic to generate a default DB using the default provider. If multiple providers are defined, the API will fail as expected: the user has to provide details on where to ingest the documents. (Closes #1270) ## Test Plan The issue description details how to replicate the problem. [//]: # (## Documentation) --------- Signed-off-by: Daniele Martinoli <dmartino@redhat.com>	2025-03-04 21:44:13 -08:00
Xi Yan	78962be996	chore: refactor create_and_execute_turn and resume_turn (#1399 ) # What does this PR do? - Closes https://github.com/meta-llama/llama-stack/issues/1212 [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/integration/agents/test_agents.py --inference-model "meta-llama/Llama-3.3-70B-Instruct" ``` <img width="1203" alt="image" src="https://github.com/user-attachments/assets/35b60017-b3f2-4e98-87f2-2868730261bd" /> ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/agents/test_agents.py::test_rag_and_code_agent --inference-model "meta-llama/Llama-3.3-70B-Instruct" ``` [//]: # (## Documentation)	2025-03-04 16:07:30 -08:00
Ashwin Bharambe	dd0db8038b	refactor(test): unify vector_io tests and make them configurable (#1398 ) ## Test Plan `LLAMA_STACK_CONFIG=inference=sentence-transformers,vector_io=sqlite-vec pytest -s -v test_vector_io.py --embedding-model all-miniLM-L6-V2 --inference-model='' --vision-inference-model=''` ``` test_vector_io.py::test_vector_db_retrieve[txt=:vis=:emb=all-miniLM-L6-V2] PASSED test_vector_io.py::test_vector_db_register[txt=:vis=:emb=all-miniLM-L6-V2] PASSED test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case0] PASSED test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case1] PASSED test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case2] PASSED test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case3] PASSED test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case4] PASSED ``` Same thing with: - LLAMA_STACK_CONFIG=inference=sentence-transformers,vector_io=faiss - LLAMA_STACK_CONFIG=fireworks (Note that ergonomics will soon be improved re: cmd-line options and env variables)	2025-03-04 13:37:45 -08:00
ehhuang	fd8c991393	fix: rag as attachment bug (#1392 ) Summary: Test Plan: added new test LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/api/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B	2025-03-04 13:08:16 -08:00
Xi Yan	e9a37bad63	chore: rename task_config to benchmark_config (#1397 ) # What does this PR do? - This was missed from previous deprecation: https://github.com/meta-llama/llama-stack/pull/1186 - Part of https://github.com/meta-llama/llama-stack/issues/1396 [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` pytest -v -s --nbval-lax ./llama-stack/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb ``` [//]: # (## Documentation)	2025-03-04 12:44:04 -08:00
Xi Yan	158b6dc404	chore: deprecate allow_turn_resume (#1377 ) # What does this PR do? - Deprecate allow_turn_resume flag as this is used for staying backward compat. - Closes https://github.com/meta-llama/llama-stack/issues/1363 [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/api/agents/test_agents.py --inference-model "meta-llama/Llama-3.3-70B-Instruct" --record-responses ``` <img width="1054" alt="image" src="https://github.com/user-attachments/assets/d31de2d4-0953-41e1-a71a-7e1579fa351a" /> [//]: # (## Documentation)	2025-03-04 12:22:11 -08:00
ehhuang	07a992ef90	feat: deterministic tools ordering (#1380 ) Summary: 1. The `tools` parameter we construct to pass the inference API is non-deterministic. As a result, our recordable mocks is flaky as the ordering change sometimes. This PR makes it so that `tools` ordering is deterministic and aligned with the order user specified. 2. In recordable mock key generation, client tool's parameter type was 'str' and now is 'string' for some reason. I didn't dig into exactly why, but just regenerated the fixtures. Test Plan: Regenerate mocks: ``` LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --record-responses ``` Rerun tests without --record-responses: ``` LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B ```	2025-03-03 20:38:07 -08:00
Ashwin Bharambe	0a76ece249	feat: add more logs to agent_instance.py	2025-03-03 16:15:47 -08:00
Ashwin Bharambe	816fdf289a	refactor: move generation.py to llama3	2025-03-03 13:50:19 -08:00
Ashwin Bharambe	02066591b8	refactor: move generation.py to llama3	2025-03-03 13:46:50 -08:00
Ashwin Bharambe	725423c95c	refactor: move llama3 impl to meta_reference provider (#1364 ) Just moving bits to a better place ## Test Plan ```bash torchrun $CONDA_PREFIX/bin/pytest -s -v test_text_inference.py ```	2025-03-03 13:22:57 -08:00
Xi Yan	7d111c7510	feat: unify max_infer_iters in client/server agent loop (#1309 ) # What does this PR do? We currently use `max_infer_iters` in 2 different ways 1/ Server: track number of times 2/ Client side: track number of times we send `resume_turn` request This PR gets rid of the need of (2) and makes server track total number of times we perform inference within a Turn NOTE The PR will assume StopReason is set to - end_of_message: turn is not finished, we could be waiting for client tool call responses - end_of_turn: if the entire turn is finished and there's no more things to be done. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/agents/test_agents.py::test_custom_tool_infinite_loop --inference-model "meta-llama/Llama-3.3-70B-Instruct" ``` [//]: # (## Documentation)	2025-03-03 10:08:36 -08:00
Ashwin Bharambe	46b0a404e8	chore: remove straggler references to llama-models (#1345 ) Straggler references cleanup	2025-03-01 14:26:03 -08:00
Ashwin Bharambe	8bbd52bb9f	chore: remove dependency on llama_models completely (#1344 )	2025-03-01 12:48:08 -08:00
Ashwin Bharambe	6609d4ada4	feat: allow conditionally enabling providers in run.yaml (#1321 ) # What does this PR do? We want to bundle a bunch of (typically remote) providers in a distro template and be able to configure them "on the fly" via environment variables. So far, we have been able to do this with simple env var replacements. However, sometimes you want to only conditionally enable providers (because the relevant remote services may not be alive, or relevant.) This was not possible until now. To aid this, we add a simple (bash-like) env var replacement enhancement: `${env.FOO+bar}` evaluates to `bar` if the variable is SET and evaluates to empty string if it is not. On top of that, we update our main resolver to ignore any provider whose ID is null. This allows using the distro like this: ```bash llama stack run dev --env CHROMADB_URL=http://localhost:6001 --env ENABLE_CHROMADB=1 ``` when only Chroma is UP. This disables the other `pgvector` provider in the run configuration. ## Test Plan Hard code `chromadb` as the vector io provider inside `test_vector_io.py` and run: ```bash LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -s -v tests/client-sdk/vector_io/ --embedding-model all-MiniLM-L6-v2 ```	2025-03-01 11:19:14 -08:00
ehhuang	21ec67356c	fix: RAG with documents (#1337 ) Summary: This was broken by https://github.com/meta-llama/llama-stack/pull/1015/files#r1975394190 Test Plan: added e2e test	2025-02-28 16:51:00 -08:00
ehhuang	2faee24873	chore: better raise (#1335 ) Summary: addresses https://github.com/meta-llama/llama-stack/pull/1282#discussion_r1972546802 Test Plan:	2025-02-28 16:41:20 -08:00
Xi Yan	15f69e75ff	fix: replace eval with json decoding for format_adapter (#1328 ) # What does this PR do? - using `eval` is a security risk [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan - see https://github.com/meta-llama/llama-stack/pull/1327 cc @SLR722 we will need to update the corresponding dataset via ```python def update_to_json_str(): dataset = datasets.load_dataset(...) processed_dataset = dataset[split].map( lambda x: { "column": json.dumps(eval(x["column"])) } ) processed_dataset.push_to_hub(...) ``` [//]: # (## Documentation)	2025-02-28 11:25:23 -08:00
Xi Yan	6520baebed	fix: replace eval with json decoding (#1327 ) # What does this PR do? - Using `eval` on server is a security risk - Replace `eval` with `json.loads` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` pytest -v -s --nbval-lax ./llama-stack/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb ``` <img width="747" alt="image" src="https://github.com/user-attachments/assets/7aff3d95-0b12-4394-b9d0-aeff791eee38" /> [//]: # (## Documentation)	2025-02-28 11:10:45 -08:00
Sébastien Han	6fa257b475	chore(lint): update Ruff ignores for project conventions and maintainability (#1184 ) - Added new ignores from flake8-bugbear (`B007`, `B008`) - Ignored `C901` (high function complexity) for now, pending review - Maintained PyTorch conventions (`N812`, `N817`) - Allowed `E731` (lambda assignments) for flexibility - Consolidated existing ignores (`E402`, `E501`, `F405`, `C408`, `N812`) - Documented rationale for each ignored rule This keeps our linting aligned with project needs while tracking potential fixes. Signed-off-by: Sébastien Han <seb@redhat.com> Signed-off-by: Sébastien Han <seb@redhat.com>	2025-02-28 09:36:49 -08:00
Hardik Shah	8efa53daf1	fix: Agent telemetry inputs/outputs should be structured (#1302 ) Original telemetry outputs for agent turns look like this. Note: how output was a `str(message)` making it difficult to read them back for downstream tasks ( eg. building eval datasets ) ``` { │ │ 'input': [ │ │ │ '{"role":"system","content":"You are a helpful assistant. Use search tool to answer the questions. "}', │ │ │ '{"role":"user","content":"Which teams played in the NBA western conference finals of 2024","context":null}' │ │ ], │ │ 'output': "content: tool_calls: [ToolCall(call_id='8b7294ec-a83f-4798-ad8f-6bed662f08b6', tool_name=<BuiltinTool.brave_search: 'brave_search'>, arguments={'query': 'NBA Western Conference Finals 2024 teams'})]" │ }, ``` Updated the outputs to be structured . ## Test ```python import uuid from llama_stack_client.lib.agents.agent import Agent from llama_stack_client.lib.agents.event_logger import EventLogger from llama_stack_client.types.agent_create_params import AgentConfig model_id = "meta-llama/Llama-3.1-8B-Instruct" agent_config = AgentConfig( model=model_id, instructions="You are a helpful assistant who will use the web search tools to help with answering questions.\nOnly provide final answer in short without writing full sentences. Use web search", toolgroups=["builtin::websearch"], enable_session_persistence=True, ) agent = Agent(client, agent_config) session_id = agent.create_session(uuid.uuid4().hex) response = agent.create_turn( messages=[ { "role": "user", "content": "latest news about llama stack", } ], session_id=session_id, stream=False, ) pprint(response) ``` Output: ``` Turn( │ input_messages=[UserMessage(content='latest news about llama stack', role='user', context=None)], │ output_message=CompletionMessage( │ │ content="The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.", │ │ role='assistant', │ │ stop_reason='end_of_turn', │ │ tool_calls=[] │ ), │ session_id='77379546-4598-485a-b4f4-84e5da28c513', │ started_at=datetime.datetime(2025, 2, 27, 11, 2, 43, 915243, tzinfo=TzInfo(-08:00)), │ steps=[ │ │ InferenceStep( │ │ │ api_model_response=CompletionMessage( │ │ │ │ content='', │ │ │ │ role='assistant', │ │ │ │ stop_reason='end_of_turn', │ │ │ │ tool_calls=[ │ │ │ │ │ ToolCall( │ │ │ │ │ │ arguments={'query': 'latest news llama stack'}, │ │ │ │ │ │ call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b', │ │ │ │ │ │ tool_name='brave_search' │ │ │ │ │ ) │ │ │ │ ] │ │ │ ), │ │ │ step_id='81c16bd3-eb00-4721-8edc-f386e07391a3', │ │ │ step_type='inference', │ │ │ turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 11, 2, 44, 637149, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 11, 2, 43, 915831, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ToolExecutionStep( │ │ │ step_id='4782d609-a62e-45f5-8d2a-25a43db46288', │ │ │ step_type='tool_execution', │ │ │ tool_calls=[ │ │ │ │ ToolCall( │ │ │ │ │ arguments={'query': 'latest news llama stack'}, │ │ │ │ │ call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b', │ │ │ │ │ tool_name='brave_search' │ │ │ │ ) │ │ │ ], │ │ │ tool_responses=[ │ │ │ │ ToolResponse( │ │ │ │ │ call_id='84c0fa10-e24a-4f91-a9ff-415a9ec0bb0b', │ │ │ │ │ content='{"query": "latest news llama stack", "top_k": [{"title": "Llama 3.2: Revol. ....... Hacker News.", "score": 0.6186197, "raw_content": null}]}', │ │ │ │ │ tool_name='brave_search', │ │ │ │ │ metadata=None │ │ │ │ ) │ │ │ ], │ │ │ turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 11, 2, 46, 272176, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 11, 2, 44, 640743, tzinfo=TzInfo(-08:00)) │ │ ), │ │ InferenceStep( │ │ │ api_model_response=CompletionMessage( │ │ │ │ content="The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.", │ │ │ │ role='assistant', │ │ │ │ stop_reason='end_of_turn', │ │ │ │ tool_calls=[] │ │ │ ), │ │ │ step_id='37994419-5da3-4e84-a010-8d9b85366262', │ │ │ step_type='inference', │ │ │ turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 11, 2, 48, 961275, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 11, 2, 46, 273168, tzinfo=TzInfo(-08:00)) │ │ ) │ ], │ turn_id='2c6b5273-4b16-404f-bed2-c0025fd63b45', │ completed_at=datetime.datetime(2025, 2, 27, 11, 2, 48, 962318, tzinfo=TzInfo(-08:00)), │ output_attachments=[] ) ``` ## Check for Telemetry ```python agent_logs = [] for span in client.telemetry.query_spans( attribute_filters=[ {"key": "session_id", "op": "eq", "value": session_id}, ], attributes_to_return=['input', 'output'], ): agent_logs.append(span.attributes) pprint(json.loads(agent_logs[-1]['output'])) ``` ``` { │ 'content': "The latest news about Llama Stack is that Meta has released Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B) and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices. Additionally, Llama Stack distributions have been released to simplify the way developers work with Llama models in different environments. However, a critical vulnerability has been discovered in Meta's Llama-Stack, which puts AI applications at risk.", │ 'tool_calls': [] } ```	2025-02-27 23:06:37 -08:00
Luis Tomas Bolivar	73c6f6126f	fix: Avoid unexpected keyword argument for sentence_transformers (#1269 ) Now that remote-vllm include inline::sentence_transformers there is an issue building the image: Error building stack: SentenceTransformersInferenceConfig.sample_run_config() got an unexpected keyword argument '__distro_dir__' To avoid that issue this fix extends the sample_run_config to accept extra kwargs	2025-02-27 16:47:26 -08:00
ehhuang	a34f3aafcf	fix: don't include tool args not in the function definition (#1307 ) # Summary: Right now we would include toolgroup args when we encode messages with tool_calls, which is confusing the model since they not in the function description (see test plan for example). # Test Plan: Add a print statement before raw prompt is sent to providers (no good way to test this currently) Before: ``` cated in the same neighborhood?<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>\n\n[knowledge_search(query="Laleli Mosque and Esma Sultan Mansion same neighborhood", vector_db_ids=["829a68735d744dc3830409dcc782964a"])]<\|eot_id\|><\|start_header_id\|>ipython<\|end_header_id\|>\n\nknowledge_search tool found 5 chunks:\nBEGIN of ``` Note the extra `vector_db_ids` After ``` >user<\|end_header_id\|>\n\nAre the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>\n\n[knowledge_search(query="Laleli Mosque and Esma Sultan Mansion same neighborhood")]<\|eot_id\|><\|start_header_id\|>ipython<\|end_header_id\|>\n\nknowledge_search tool found ```	2025-02-27 16:25:30 -08:00
Xi Yan	663c6b0537	fix: duplicate ToolResponseMessage in Turn message history (#1305 ) # What does this PR do? - Reproduce with: https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/e2e_loop_with_client_tools.py - Root cause: when we have ToolResponseMessage as part of Turn, we will create duplicate ToolResponseMessage in the conversation history when getting messages from a Turn. - Fix: avoid adding duplicate ToolResponseMessage from a turn's input_messages. - If it is part of a Turn's steps, only add it when processing the steps. - If it is not part of a Turn's steps, add it. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/agents/test_agents.py --inference-model meta-llama/Llama-3.1-8B-Instruct ``` ``` python -m examples.agents.e2e_loop_with_client_tools localhost 8321 ``` ```python Turn( │ input_messages=[ │ │ UserMessage( │ │ │ content='What was the closing price of Google stock (ticker symbol GOOG) for 2023 ?', │ │ │ role='user', │ │ │ context=None │ │ ), │ │ ToolResponseMessage( │ │ │ call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94', │ │ │ content='"[{\\"(\'Year\', \'\')\\":2023,\\"(\'Close\', \'GOOG\')\\":140.4254302979}]"', │ │ │ role='tool', │ │ │ tool_name='get_ticker_data' │ │ ) │ ], │ output_message=CompletionMessage( │ │ content='Note: The actual closing price for 2023 may not be available or may be different from the result obtained above. The result is based on a hypothetical call to the get_ticker_data function.', │ │ role='assistant', │ │ stop_reason='end_of_turn', │ │ tool_calls=[] │ ), │ session_id='4c791107-f0d8-456e-a27f-aa2fdc72b871', │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 412928, tzinfo=TzInfo(-08:00)), │ steps=[ │ │ ShieldCallStep( │ │ │ step_id='e0514587-b7d6-4bba-8609-8e05a3a46d8a', │ │ │ step_type='shield_call', │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 858382, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 425204, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ), │ │ InferenceStep( │ │ │ api_model_response=CompletionMessage( │ │ │ │ content='', │ │ │ │ role='assistant', │ │ │ │ stop_reason='end_of_turn', │ │ │ │ tool_calls=[ │ │ │ │ │ ToolCall( │ │ │ │ │ │ arguments={ │ │ │ │ │ │ │ 'ticker_symbol': 'GOOG', │ │ │ │ │ │ │ 'start': '2023-01-01', │ │ │ │ │ │ │ 'end': '2023-12-31' │ │ │ │ │ │ }, │ │ │ │ │ │ call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94', │ │ │ │ │ │ tool_name='get_ticker_data' │ │ │ │ │ ) │ │ │ │ ] │ │ │ ), │ │ │ step_id='a3ceec6a-f149-49d5-a1c2-db461e3f6e9f', │ │ │ step_type='inference', │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 26, 910179, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 25, 871130, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ShieldCallStep( │ │ │ step_id='f9339865-96ca-4425-af42-a87bab343e24', │ │ │ step_type='shield_call', │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 383013, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 26, 944012, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ), │ │ ToolExecutionStep( │ │ │ step_id='e317b74a-c4f3-4845-99a3-7d93aa6ea6c8', │ │ │ step_type='tool_execution', │ │ │ tool_calls=[ │ │ │ │ ToolCall( │ │ │ │ │ arguments={'ticker_symbol': 'GOOG', 'start': '2023-01-01', 'end': '2023-12-31'}, │ │ │ │ │ call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94', │ │ │ │ │ tool_name='get_ticker_data' │ │ │ │ ) │ │ │ ], │ │ │ tool_responses=[ │ │ │ │ ToolResponse( │ │ │ │ │ call_id='0d5f94fb-f070-4dc1-8eeb-63eb5918ec94', │ │ │ │ │ content='"[{\\"(\'Year\', \'\')\\":2023,\\"(\'Close\', \'GOOG\')\\":140.4254302979}]"', │ │ │ │ │ tool_name='get_ticker_data', │ │ │ │ │ metadata=None │ │ │ │ ) │ │ │ ], │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 718810, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 26, 943792, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ShieldCallStep( │ │ │ step_id='c4236616-db89-4c04-ad04-f51cfb726385', │ │ │ step_type='shield_call', │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 958946, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 732680, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ), │ │ InferenceStep( │ │ │ api_model_response=CompletionMessage( │ │ │ │ content='Note: The actual closing price for 2023 may not be available or may be different from the result obtained above. The result is based on a hypothetical call to the get_ticker_data function.', │ │ │ │ role='assistant', │ │ │ │ stop_reason='end_of_turn', │ │ │ │ tool_calls=[] │ │ │ ), │ │ │ step_id='3386f896-2026-41e4-a60f-f6f3c3981cf6', │ │ │ step_type='inference', │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 74750, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 28, 970724, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ShieldCallStep( │ │ │ step_id='bc57ac8c-f94e-4758-bf1a-0dd734eca1cf', │ │ │ step_type='shield_call', │ │ │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 443016, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 86726, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ) │ ], │ turn_id='6ed9c25a-a4fe-4b51-ae13-de248624c2fc', │ completed_at=datetime.datetime(2025, 2, 27, 13, 59, 37, 459456, tzinfo=TzInfo(-08:00)), │ output_attachments=[] ) ``` ```python Turn( │ input_messages=[ │ │ UserMessage(content='What is 40+30?', role='user', context=None), │ │ ToolResponseMessage( │ │ │ call_id='8e54aca9-244d-44ca-ada0-0365090e8622', │ │ │ content='{"success": true, "result": 70.0}', │ │ │ role='tool', │ │ │ tool_name='calculator' │ │ ) │ ], │ output_message=CompletionMessage( │ │ content='The result of the calculation is 70.', │ │ role='assistant', │ │ stop_reason='end_of_turn', │ │ tool_calls=[] │ ), │ session_id='4c791107-f0d8-456e-a27f-aa2fdc72b871', │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 156903, tzinfo=TzInfo(-08:00)), │ steps=[ │ │ ShieldCallStep( │ │ │ step_id='17b6b645-31cc-4be9-a758-a4f3b741ced9', │ │ │ step_type='shield_call', │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 780564, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 174515, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ), │ │ InferenceStep( │ │ │ api_model_response=CompletionMessage( │ │ │ │ content='', │ │ │ │ role='assistant', │ │ │ │ stop_reason='end_of_turn', │ │ │ │ tool_calls=[ │ │ │ │ │ ToolCall( │ │ │ │ │ │ arguments={'x': 40.0, 'y': 30.0, 'operation': 'add'}, │ │ │ │ │ │ call_id='8e54aca9-244d-44ca-ada0-0365090e8622', │ │ │ │ │ │ tool_name='calculator' │ │ │ │ │ ) │ │ │ │ ] │ │ │ ), │ │ │ step_id='f59e951a-2b75-497d-a075-ec9aad9aad12', │ │ │ step_type='inference', │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 141869, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 0, 792047, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ShieldCallStep( │ │ │ step_id='efafa0cf-23b9-4a90-8350-3a186d80925d', │ │ │ step_type='shield_call', │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 766293, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 177473, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ), │ │ ToolExecutionStep( │ │ │ step_id='877cfbe7-57a8-4056-9c29-49aa38dd337c', │ │ │ step_type='tool_execution', │ │ │ tool_calls=[ │ │ │ │ ToolCall( │ │ │ │ │ arguments={'x': 40.0, 'y': 30.0, 'operation': 'add'}, │ │ │ │ │ call_id='8e54aca9-244d-44ca-ada0-0365090e8622', │ │ │ │ │ tool_name='calculator' │ │ │ │ ) │ │ │ ], │ │ │ tool_responses=[ │ │ │ │ ToolResponse( │ │ │ │ │ call_id='8e54aca9-244d-44ca-ada0-0365090e8622', │ │ │ │ │ content='{"success": true, "result": 70.0}', │ │ │ │ │ tool_name='calculator', │ │ │ │ │ metadata=None │ │ │ │ ) │ │ │ ], │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 930899, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 177202, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ShieldCallStep( │ │ │ step_id='d47c6160-45d9-47c1-8e39-2faae65ee468', │ │ │ step_type='shield_call', │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 3, 510402, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 2, 949433, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ), │ │ InferenceStep( │ │ │ api_model_response=CompletionMessage( │ │ │ │ content='The result of the calculation is 70.', │ │ │ │ role='assistant', │ │ │ │ stop_reason='end_of_turn', │ │ │ │ tool_calls=[] │ │ │ ), │ │ │ step_id='660ba1cc-770e-471c-bf6e-11e103d74443', │ │ │ step_type='inference', │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 4, 814944, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 3, 521309, tzinfo=TzInfo(-08:00)) │ │ ), │ │ ShieldCallStep( │ │ │ step_id='4dab8bb0-7d38-4465-ae1a-10069de2b3d1', │ │ │ step_type='shield_call', │ │ │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ │ │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 5, 428561, tzinfo=TzInfo(-08:00)), │ │ │ started_at=datetime.datetime(2025, 2, 27, 14, 0, 4, 825970, tzinfo=TzInfo(-08:00)), │ │ │ violation=None │ │ ) │ ], │ turn_id='4daff286-f703-417e-a5dc-0e158582bbec', │ completed_at=datetime.datetime(2025, 2, 27, 14, 0, 5, 462823, tzinfo=TzInfo(-08:00)), │ output_attachments=[] ) ``` [//]: # (## Documentation)	2025-02-27 15:06:47 -08:00
Xi Yan	564f0e5f93	fix: Revert "chore: remove vector_db_id from AgentSessionInfo" (#1299 ) Reverts meta-llama/llama-stack#1296 This change breaks test: `session_info.vector_db_id` is actually used ``` pytest -v tests/client-sdk/agents/test_agents.py::test_rag_and_code_agent --inference-model meta-llama/Llama-3.1-8B-Instruct ```	2025-02-27 10:37:15 -08:00
Xi Yan	200ef29233	chore: remove vector_db_id from AgentSessionInfo (#1296 ) # What does this PR do? - It is not being used anywhere and doesn't make sense to have 1 single vector_db_id in an agent session. No top level API change. - See https://github.com/meta-llama/llama-stack/pull/1286#discussion_r1972569881 [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan - See https://github.com/meta-llama/llama-stack/pull/1286#discussion_r1972569881 [//]: # (## Documentation)	2025-02-27 10:13:10 -08:00
Xi Yan	fc5aff3ccf	feat: ability to retrieve agents session, turn, step by ids (#1286 ) # What does this PR do? - Fix up rotten implementation for retrieving agent's Session, Turn, Step with actual working implementation. - Update `getting_started` notebook with retrieving by agent session_id. https://github.com/meta-llama/llama-stack/blob/export_agent_dataset/docs/getting_started.ipynb [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Test with script: https://gist.github.com/yanxi0830/657cecee8f1f0e39d322963d9c0f598e <img width="503" alt="image" src="https://github.com/user-attachments/assets/5ea9bc33-83d1-40bc-98e1-b68393158387" /> [//]: # (## Documentation)	2025-02-27 09:45:14 -08:00
ehhuang	0762c61402	feat: don't silently ignore incorrect toolgroup (#1285 )	2025-02-27 08:11:09 -05:00
Ihar Hrachyshka	2250ab7274	fix: don't attempt to clean gpu memory up when device is cpu (#1191 ) This is a follow up to: https://github.com/meta-llama/llama-stack/pull/1140 Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com> # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] Avoid unnecessary GPU memory clean attempt when the GPU is not used for training. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan With CPU: ``` INFO 2025-02-26 16:43:56,267 torchtune.utils._logging:121: Model checkpoint of size 6.43 GB saved to /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth INFO 2025-02-26 16:43:56,274 torchtune.utils._logging:132: Adapter checkpoint of size 0.00 GB saved to /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth model_file_path /Users/ihrachys/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0 ``` With CUDA: ``` INFO 2025-02-26 21:39:24,314 torchtune.utils._logging:121: Model checkpoint of size 6.43 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth INFO 2025-02-26 21:39:24,333 torchtune.utils._logging:132: Adapter checkpoint of size 0.00 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth model_file_path /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0 ``` [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-02-26 15:12:11 -08:00
ehhuang	270d64007a	fix: sqlite conn (#1282 ) # Summary: Our tests sometimes error out with ``` ========================== 11 passed, 342 warnings in 58.86s ========================== Error exporting span to SQLite: Cannot operate on a closed database. Fatal Python error: _enter_buffered_busy: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads Python runtime state: finalizing (tstate=0x000000012af04280) Current thread 0x00000001fa29c240 (most recent call first): <no Python frame> ``` Usually able to repro this by running 10 times. The proposed fix is to use threadsafe var for creating sqlite connection to ensure connection is only used by one thread. Not 100% if this is the fix, but am not able to repro with this. # Test Plan: Run 10 times and saw no more errors ``` for i in {1..10}; do echo "=== Starting Run $i ===" LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B if [[ $? -ne 0 ]]; then echo "=== Run $i FAILED with exit code $? ===" break else echo "=== Run $i PASSED ===" fi echo done ```	2025-02-26 14:44:31 -08:00
ehhuang	c8a20b8ed0	feat: allow specifying specific tool within toolgroup (#1239 ) Summary: E.g. `builtin::rag::knowledge_search` Test Plan: ``` LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/agents/ --safety-shield meta-llama/Llama-Guard-3-8B ```	2025-02-26 14:07:05 -08:00

1 2 3 4 5 ...

274 commits