llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-03 09:53:45 +00:00

Author	SHA1	Message	Date
Botao Chen	f369871083	feat: [New Eval Benchamark] IfEval (#1708 ) # What does this PR do? In this PR, we added a new eval open benchmark IfEval based on paper https://arxiv.org/abs/2311.07911 to measure the model capability of instruction following. ## Test Plan spin up a llama stack server with open-benchmark template run `llama-stack-client --endpoint xxx eval run-benchmark "meta-reference-ifeval" --model-id "meta-llama/Llama-3.3-70B-Instruct" --output-dir "/home/markchen1015/" --num-examples 20` on client side and get the eval aggregate results	2025-03-19 16:39:59 -07:00
yyymeta	d117bfe597	feat: [new open benchmark] DocVQA (#1647 ) # What does this PR do? DocVQA asks model to look a a picture, then answer a question given in text, with a text answer by text information in the picture. these questions often require understanding of relative positions of texts within the picture. original dataset is defined in the "Task1" of https://www.docvqa.org/datasets ## Test Plan setup llama server with ``` llama stack run ./llama_stack/templates/open-benchmark/run.yaml ``` then send traffic: ``` llama-stack-client eval run-benchmark "meta-reference-docvqa" --model-id meta-llama/Llama-3.3-70B-Instruct --output-dir /tmp/gpqa --num-examples 200 ```	2025-03-19 14:56:14 -07:00
Derek Higgins	6949bd1999	fix: Call pandas.read_* in a seperate thread (#1698 ) These block on io reads which in turn block the server. Move them to their own thread. Closes: #1697 # What does this PR do? To avoid blocking the main eventloop, updates datasetio/localfs to load data in a seperate thread Signed-off-by: Derek Higgins <derekh@redhat.com>	2025-03-19 10:46:37 -07:00
Hardik Shah	65ca85ba6b	fix: Updating `ToolCall.arguments` to allow for json strings that can be decoded on client side (#1685 ) ### What does this PR do? Currently, `ToolCall.arguments` is a `Dict[str, RecursiveType]`. However, on the client SDK side -- the `RecursiveType` gets deserialized into a number ( both int and float get collapsed ) and hence when params are `int` they get converted to float which might break client side tools that might be doing type checking. Closes: https://github.com/meta-llama/llama-stack/issues/1683 ### Test Plan Stainless changes -- https://github.com/meta-llama/llama-stack-client-python/pull/204 ``` pytest -s -v --stack-config=fireworks tests/integration/agents/test_agents.py --text-model meta-llama/Llama-3.1-8B-Instruct ```	2025-03-19 10:36:19 -07:00
yyymeta	b79e0435de	fix: avoid tensor memory error (#1688 ) # What does this PR do? we randomly get errors like the following, it's most likely due to accessing an object that is already deallocated ``` E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] Traceback (most recent call last): E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] fn(i, args) E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 611, in _wrap E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] ret = record(fn)(args_) E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] return f(args, kwargs) E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/internal-llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 249, in worker_process_entrypoint E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] task = req_gen.send(result) E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/internal-llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 156, in retrieve_requests E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] torch.distributed.broadcast_object_list( E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] return func(args, **kwargs) E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3504, in broadcast_object_list E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] object_list[i] = _tensor_to_object(obj_view, obj_size, group) E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2961, in _tensor_to_object E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] return _unpickler(io.BytesIO(buf)).load() E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] EOFError: Ran out of input E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] Process SpawnProcess-1: Traceback (most recent call last): ``` ## Test Plan start server ``` llama-stack-client eval run-benchmark mmmu_v1 --model-id meta-llama/Llama-4-17B-Omni-Instruct --output-dir /tmp/mmmu_standard --num-examples 30 ``` [//]: # (## Documentation)	2025-03-18 16:17:29 -07:00
Ihar Hrachyshka	0cbb7f7f21	chore: fix mypy violations in post_training modules (#1548 ) # What does this PR do? Fixes a bunch of violations. Note: this patch touches all files but post_training.py that will be significantly changed by #1437, hence leaving it out of the picture for now. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Testing with https://github.com/meta-llama/llama-stack/pull/1543 Also checked that GPU training works with the change: ``` INFO: ::1:53316 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 200 OK INFO: ::1:53316 - "GET /v1/post-training/job/status?job_uuid=test-jobb5ca2d84-d541-42f8-883b-762828b4c0e7 HTTP/1.1" 200 OK INFO: ::1:53316 - "GET /v1/post-training/job/artifacts?job_uuid=test-jobb5ca2d84-d541-42f8-883b-762828b4c0e7 HTTP/1.1" 200 OK 21:24:01.161 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (32526.75ms) 21:23:28.769 [DEBUG] Setting manual seed to local seed 3918872849. Local seed is seed + rank = 3918872849 + 0 21:23:28.996 [INFO] Identified model_type = Llama3_2. Ignoring output.weight in checkpoint in favor of the tok_embedding.weight tied weights. 21:23:29.933 [INFO] Memory stats after model init: GPU peak memory allocation: 6.05 GiB GPU peak memory reserved: 6.10 GiB GPU peak memory active: 6.05 GiB 21:23:29.934 [INFO] Model is initialized with precision torch.bfloat16. 21:23:30.115 [INFO] Tokenizer is initialized. 21:23:30.118 [INFO] Optimizer is initialized. 21:23:30.119 [INFO] Loss is initialized. 21:23:30.896 [INFO] Dataset and Sampler are initialized. 21:23:30.898 [INFO] Learning rate scheduler is initialized. 21:23:31.618 [INFO] Memory stats after model init: GPU peak memory allocation: 6.24 GiB GPU peak memory reserved: 6.30 GiB GPU peak memory active: 6.24 GiB 21:23:31.620 [INFO] Starting checkpoint save... 21:23:59.428 [INFO] Model checkpoint of size 6.43 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth 21:23:59.445 [INFO] Adapter checkpoint of size 0.00 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth ``` [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-03-18 14:58:16 -07:00
Sarthak Deshpande	5ece262976	chore: Make code interpreter async (#1654 ) # What does this PR do? Made code interpreter tool call to be async such that its non blocking ## Test Plan pytest -s -v tests/integration/agents/test_agents.py --stack-config=together --text-model=meta-llama/Llama-3.3-70B-Instruct <img width="1693" alt="image" src="https://github.com/user-attachments/assets/42520bb6-7acf-42d5-b71f-b35ca149d722" /> [//]: # (## Documentation) Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>	2025-03-18 14:13:46 -07:00
Daniele Martinoli	cca9bd6cc3	feat: Qdrant inline provider (#1273 ) # What does this PR do? Removed local execution option from the remote Qdrant provider and introduced an explicit inline provider for the embedded execution. Updated the ollama template to include this option: this part can be reverted in case we don't want to have two default `vector_io` providers. (Closes #1082) ## Test Plan Build and run an ollama distro: ```bash llama stack build --template ollama --image-type conda llama stack run --image-type conda ollama ``` Run one of the sample ingestionapplicatinos like [rag_with_vector_db.py](https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/rag_with_vector_db.py), but replace this line: ```py selected_vector_provider = vector_providers[0] ``` with the following, to use the `qdrant` provider: ```py selected_vector_provider = vector_providers[1] ``` After running the test code, verify the timestamp of the Qdrant store: ```bash % ls -ltr ~/.llama/distributions/ollama/qdrant.db/collection/test_vector_db_* total 784 -rw-r--r--@ 1 dmartino staff 401408 Feb 26 10:07 storage.sqlite ``` [//]: # (## Documentation) --------- Signed-off-by: Daniele Martinoli <dmartino@redhat.com> Co-authored-by: Francisco Arceo <farceo@redhat.com>	2025-03-18 14:04:21 -07:00
Matthew Farrellee	706b4ca651	feat: support nvidia hosted vision models (llama 3.2 11b/90b) (#1278 ) # What does this PR do? support nvidia hosted 3.2 11b/90b vision models. they are not hosted on the common https://integrate.api.nvidia.com/v1. they are hosted on their own individual urls. ## Test Plan `LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -s -v tests/client-sdk/inference/test_vision_inference.py --inference-model=meta/llama-3.2-11b-vision-instruct -k image`	2025-03-18 11:54:10 -07:00
Luis Tomas Bolivar	168cbcbb92	fix: Add the option to not verify SSL at remote-vllm provider (#1585 ) # What does this PR do? Add the option to not verify SSL certificates for the remote-vllm provider. This allows llama stack server to talk to remote LLMs which have self-signed certificates Partially addresses #1545	2025-03-18 09:33:35 -04:00
ehhuang	37f155e41d	feat(agent): support multiple tool groups (#1556 ) Summary: closes #1488 Test Plan: added new integration test ``` LLAMA_STACK_CONFIG=dev pytest -s -v tests/integration/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model openai/gpt-4o-mini ``` --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1556). * __->__ #1556 * #1550	2025-03-17 22:13:09 -07:00
ehhuang	c23a7af5d6	fix: agents with non-llama model (#1550 ) # Summary: Includes fixes to get test_agents working with openAI model, e.g. tool parsing and message conversion # Test Plan: ``` LLAMA_STACK_CONFIG=dev pytest -s -v tests/integration/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model openai/gpt-4o-mini ``` --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1550). * #1556 * __->__ #1550	2025-03-17 22:11:06 -07:00
Xi Yan	5287b437ae	feat(api): (1/n) datasets api clean up (#1573 ) ## PR Stack - https://github.com/meta-llama/llama-stack/pull/1573 - https://github.com/meta-llama/llama-stack/pull/1625 - https://github.com/meta-llama/llama-stack/pull/1656 - https://github.com/meta-llama/llama-stack/pull/1657 - https://github.com/meta-llama/llama-stack/pull/1658 - https://github.com/meta-llama/llama-stack/pull/1659 - https://github.com/meta-llama/llama-stack/pull/1660 Client SDK - https://github.com/meta-llama/llama-stack-client-python/pull/203 CI - `1391130488` <img width="1042" alt="image" src="https://github.com/user-attachments/assets/69636067-376d-436b-9204-896e2dd490ca" /> -- the test_rag_agent_with_attachments is flaky and not related to this PR ## Doc <img width="789" alt="image" src="https://github.com/user-attachments/assets/b88390f3-73d6-4483-b09a-a192064e32d9" /> ## Client Usage ```python client.datasets.register( source={ "type": "uri", "uri": "lsfs://mydata.jsonl", }, schema="jsonl_messages", # optional dataset_id="my_first_train_data" ) # quick prototype debugging client.datasets.register( data_reference={ "type": "rows", "rows": [ "messages": [...], ], }, schema="jsonl_messages", ) ``` ## Test Plan - CI: `1387805545` ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/datasets/test_datasets.py ``` ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/scoring/test_scoring.py ``` ``` pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb ```	2025-03-17 16:55:45 -07:00
cdgamarose-nv	252a487085	feat: added nvidia as safety provider (#1248 ) # What does this PR do? Adds nvidia as a safety provider by interfacing with the nemo guardrails microservice. This enables checking user’s input or the LLM’s output against input and output guardrails by using the `/v1/guardrails/checks` endpoint of the[ guardrails API.](https://developer.nvidia.com/docs/nemo-microservices/guardrails/source/guides/checks-guide.html) ## Test Plan Deploy nemo guardrails service following the documentation: https://developer.nvidia.com/docs/nemo-microservices/guardrails/source/getting-started/deploy-docker.html ### Standalone: ```bash (venv) local-cdgamarose@a1u1g-rome-0153:~/llama-stack$ pytest -v -s llama_stack/providers/tests/safety/test_safety.py --providers inference=nvidia,safety=nvidia --safety-shield meta/llama-3.1-8b-instruct =================================================================================== test session starts =================================================================================== platform linux -- Python 3.10.12, pytest-8.3.4, pluggy-1.5.0 -- /localhome/local-cdgamarose/llama-stack/venv/bin/python3 cachedir: .pytest_cache metadata: {'Python': '3.10.12', 'Platform': 'Linux-5.15.0-122-generic-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0', 'html': '4.1.1'}} rootdir: /localhome/local-cdgamarose/llama-stack configfile: pyproject.toml plugins: metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0, html-4.1.1 asyncio: mode=strict, asyncio_default_fixture_loop_scope=None collected 2 items llama_stack/providers/tests/safety/test_safety.py::TestSafety::test_shield_list[--inference=nvidia:safety=nvidia] Initializing NVIDIASafetyAdapter(http://0.0.0.0:7331)... PASSED llama_stack/providers/tests/safety/test_safety.py::TestSafety::test_run_shield[--inference=nvidia:safety=nvidia] PASSED ============================================================================== 2 passed, 2 warnings in 4.78s ============================================================================== ``` ### Distribution: ``` llama stack run llama_stack/templates/nvidia/run-with-safety.yaml curl -v -X 'POST' "http://localhost:8321/v1/safety/run-shield" -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"shield_id": "meta/llama-3.1-8b-instruct", "messages":[{"role": "user", "content": "you are stupid"}]}' {"violation":{"violation_level":"error","user_message":"Sorry I cannot do this.","metadata":{"self check input":{"status":"blocked"}}}} ``` [//]: # (## Documentation) --------- Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-03-17 14:39:23 -07:00
yyymeta	fb418813fc	fix: passthrough impl response.content.text (#1665 ) # What does this PR do? current passthrough impl returns chatcompletion_message.content as a TextItem() , not a straight string. so it's not compatible with other providers, and causes parsing error downstream. change away from the generic pydantic conversion, and explicitly parse out content.text ## Test Plan setup llama server with passthrough ``` llama-stack-client eval run-benchmark "MMMU_Pro_standard" --model-id meta-llama/Llama-3-8B --output-dir /tmp/ --num-examples 20 ``` works without parsing error	2025-03-17 13:42:08 -07:00
yyymeta	a626b7bce3	feat: [new open benchmark] BFCL_v3 (#1578 ) # What does this PR do? create a new dataset BFCL_v3 from https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html overall each question asks the model to perform a task described in natural language, and additionally a set of available functions and their schema are given for the model to choose from. the model is required to write the function call form including function name and parameters , to achieve the stated purpose. the results are validated against provided ground truth, to make sure that the generated function call and the ground truth function call are syntactically and semantically equivalent, by checking their AST . ## Test Plan start server by ``` llama stack run ./llama_stack/templates/ollama/run.yaml ``` then send traffic ``` llama-stack-client eval run-benchmark "bfcl" --model-id meta-llama/Llama-3.2-3B-Instruct --output-dir /tmp/gpqa --num-examples 2 ``` [//]: # (## Documentation)	2025-03-14 12:50:49 -07:00
Sébastien Han	98b1b15e0f	refactor: move all datetime.now() calls to UTC (#1589 ) # What does this PR do? Updated all instances of datetime.now() to use timezone.utc for consistency in handling time across different systems. This ensures that timestamps are always in Coordinated Universal Time (UTC), avoiding issues with time zone discrepancies and promoting uniformity in time-related data. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-03-13 15:34:53 -07:00
Ashwin Bharambe	d072b5fa0c	test: add unit test to ensure all config types are instantiable (#1601 )	2025-03-12 22:29:58 -07:00
ehhuang	a505bf45a3	feat(api): remove tool_name from ToolResponseMessage (#1599 ) Summary: This is not used anywhere. closes #1421 Test Plan: LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/integration/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct --record-responses	2025-03-12 19:41:48 -07:00
ehhuang	6bfcb65343	test: code exec on mac (#1549 ) Summary: 1. adds option to not use bwrap for code execution 2. disable bwrap when running tests on macs Test Plan: ``` LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/integration/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct ``` Verify code_interpreter result in logs INFO 2025-03-11 08:10:39,858 llama_stack.providers.inline.agents.meta_reference.agent_instance:1032 agents: tool call code_interpreter completed with result: content='completed\n\n541\n' error_message=None error_code=None metadata=None	2025-03-12 19:21:53 -07:00
ehhuang	ed6caead72	chore: simplify _get_tool_defs (#1384 ) Summary: Test Plan: LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/integration/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct	2025-03-12 18:51:18 -07:00
ehhuang	41c9bca1aa	chore: refactor Agent toolgroup processing (#1381 ) Summary: Refactoring only. Centralize logic to preprocess toolgroup to one place. Test Plan: LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/api/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1381). * #1384 * __->__ #1381	2025-03-12 18:48:03 -07:00
ehhuang	b7a9c45477	chore: deprecate ToolResponseMessage in agent.resume API (#1566 ) # Summary: closes #1431 # Test Plan: LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/integration/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct	2025-03-12 12:10:21 -07:00
Dinesh Yeduguru	58d08d100e	feat: Add back inference metrics and preserve context variables across asyncio boundary (#1552 ) # What does this PR do? This PR adds back the changes in #1300 which were reverted in #1476 . It also adds logic to preserve context variables across asyncio boundary. this is needed with the library client since the async generator logic yields control to code outside the event loop, and on resuming, does not have the same context as before and this requires preserving the context vars. address #1477 ## Test Plan ``` curl --request POST \ --url http://localhost:8321/v1/inference/chat-completion \ --header 'content-type: application/json' \ --data '{ "model_id": "meta-llama/Llama-3.1-70B-Instruct", "messages": [ { "role": "user", "content": { "type": "text", "text": "where do humans live" } } ], "stream": false }' \| jq . { "metrics": [ { "trace_id": "kCZwO3tyQC-FuAGb", "span_id": "bsP_5a5O", "timestamp": "2025-03-11T16:47:38.549084Z", "attributes": { "model_id": "meta-llama/Llama-3.1-70B-Instruct", "provider_id": "fireworks" }, "type": "metric", "metric": "prompt_tokens", "value": 10, "unit": "tokens" }, { "trace_id": "kCZwO3tyQC-FuAGb", "span_id": "bsP_5a5O", "timestamp": "2025-03-11T16:47:38.549449Z", "attributes": { "model_id": "meta-llama/Llama-3.1-70B-Instruct", "provider_id": "fireworks" }, "type": "metric", "metric": "completion_tokens", "value": 369, "unit": "tokens" }, { "trace_id": "kCZwO3tyQC-FuAGb", "span_id": "bsP_5a5O", "timestamp": "2025-03-11T16:47:38.549457Z", "attributes": { "model_id": "meta-llama/Llama-3.1-70B-Instruct", "provider_id": "fireworks" }, "type": "metric", "metric": "total_tokens", "value": 379, "unit": "tokens" } ], "completion_message": { "role": "assistant", "content": "Humans live on the planet Earth, specifically on its landmasses and in its oceans. Here's a breakdown of where humans live:\n\n1. Continents: Humans inhabit all seven continents:\n\t* Africa\n\t* Antarctica ( temporary residents, mostly scientists and researchers)\n\t* Asia\n\t* Australia\n\t* Europe\n\t* North America\n\t* South America\n2. Countries: There are 196 countries recognized by the United Nations, and humans live in almost all of them.\n3. Cities and towns: Many humans live in urban areas, such as cities and towns, which are often located near coastlines, rivers, or other bodies of water.\n4. Rural areas: Some humans live in rural areas, such as villages, farms, and countryside.\n5. Islands: Humans inhabit many islands around the world, including those in the Pacific, Indian, and Atlantic Oceans.\n6. Mountains and highlands: Humans live in mountainous regions, such as the Himalayas, the Andes, and the Rocky Mountains.\n7. Deserts: Some humans live in desert regions, such as the Sahara, the Mojave, and the Atacama.\n8. Coastal areas: Many humans live in coastal areas, such as beaches, ports, and coastal cities.\n9. Underwater habitats: A few humans live in underwater habitats, such as research stations and submarines.\n10. Space: A small number of humans have lived in space, including astronauts on the International Space Station and those who have visited the Moon.\n\nOverall, humans can be found living in almost every environment on Earth, from the frozen tundra to the hottest deserts, and from the highest mountains to the deepest oceans.", "stop_reason": "end_of_turn", "tool_calls": [] }, "logprobs": null } ``` Orignal repro no longer showing any error: ``` LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml python -m examples.agents.e2e_loop_with_client_tools localhost 8321 ``` client logs: https://gist.github.com/dineshyv/047c7e87b18a5792aa660e311ea53166 server logs: https://gist.github.com/dineshyv/97a2174099619e9916c7c490be26e559	2025-03-12 12:01:03 -07:00
Botao Chen	90ca4d94de	fix: fix passthrough inference provider to make it work for agent (#1577 ) ## What does this PR do? We noticed that the passthrough inference provider doesn't work agent due to the type mis-match between client and server. We manually cast the llama stack client type to llama stack server type to fix the issue. ## test run `python -m examples.agents.hello localhost 8321` within llama-stack-apps <img width="1073" alt="Screenshot 2025-03-11 at 8 43 44 PM" src="https://github.com/user-attachments/assets/bd1bdd31-606a-420c-a249-95f6184cc0b1" /> fix https://github.com/meta-llama/llama-stack/issues/1560	2025-03-12 11:16:17 -07:00
Botao Chen	0b0be70605	feat: Add open benchmark template codegen (#1579 ) ## What does this PR do? As title, add codegen for open-benchmark template ## test checked the new generated run.yaml file and it's identical before and after the change Also add small improvement to together template so that missing TOGETHER_API_KEY won't crash the server which is the consistent user experience as other remote providers	2025-03-12 11:12:08 -07:00
ehhuang	59dddafd12	feat: convert typehints from client_tool to litellm format (#1565 ) Summary: supports https://github.com/meta-llama/llama-stack-client-python/pull/193 Test Plan: LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/integration/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct	2025-03-11 20:02:11 -07:00
Josh Salomon	5f90be5388	fix: Fixed bad file name in inline::localfs (#1358 ) Bug https://github.com/meta-llama/llama-stack/issues/1357 # What does this PR do? Fix a bug of a wrong file name in inline::localfs datasetio provider [//]: # (If resolving an issue, uncomment and update the line below) # (Closes #1357) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: Josh Salomon <jsalomon@redhat.com>	2025-03-11 12:46:11 -07:00
Xi Yan	43044f29e2	fix: fix llama stack run with missing agent impl (#1559 ) # What does this PR do? - recent merge https://github.com/meta-llama/llama-stack/pull/1410 introduce error ``` ValueError: Provider meta-reference (Api.agents) does not implement the following methods: [('list_agent_sessions', 'not_actually_implemented'), ('list_agents', 'not_actually_implemented')] ``` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` llama stack run ``` ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/agents/test_agents.py --text-model meta-llama/Llama-3.3-70B-Instruct ``` `1379530386` [//]: # (## Documentation)	2025-03-11 11:22:22 -07:00
Dinesh Yeduguru	85501ed875	fix: remove Llama-3.2-1B-Instruct for fireworks (#1558 ) # What does this PR do? remove Llama-3.2-1B-Instruct for fireworks as its no longer appears to be hosted on website. ## Test Plan python distro_codegen.py	2025-03-11 11:19:29 -07:00
Ihar Hrachyshka	c3d7d17bc4	chore: fix typing hints for get_provider_impl deps arguments (#1544 ) # What does this PR do? It's a dict that may contain different types, as per resolver:instantiate_provider implementation. (AFAIU it also never contains ProviderSpecs, but instances of provider implementations.) [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan mypy passing if enabled checks for these modules. (See #1543) [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-03-11 10:07:28 -07:00
Ihar Hrachyshka	0e73186a11	fix: Add missing shutdown handler for TorchtunePostTrainingImpl (#1535 ) # What does this PR do? Added missing shutdown handler. (Currently empty.) Without it, when server shuts down, it posts the following warning: ``` __main__:129 server: No shutdown method for TorchtunePostTrainingImpl ``` Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan (The test plan assumes shutdown logic is fixed, see #1495) Without the patch: ``` INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) INFO: Shutting down INFO: Waiting for application shutdown. INFO 2025-03-10 20:56:43,961 __main__:140 server: Shutting down INFO 2025-03-10 20:56:43,962 __main__:124 server: Shutting down DatasetsRoutingTable INFO 2025-03-10 20:56:43,964 __main__:124 server: Shutting down DatasetIORouter INFO 2025-03-10 20:56:43,965 __main__:124 server: Shutting down ScoringFunctionsRoutingTable INFO 2025-03-10 20:56:43,966 __main__:124 server: Shutting down ScoringRouter INFO 2025-03-10 20:56:43,967 __main__:124 server: Shutting down ModelsRoutingTable INFO 2025-03-10 20:56:43,968 __main__:124 server: Shutting down InferenceRouter INFO 2025-03-10 20:56:43,969 __main__:124 server: Shutting down ShieldsRoutingTable INFO 2025-03-10 20:56:43,971 __main__:124 server: Shutting down SafetyRouter INFO 2025-03-10 20:56:43,972 __main__:124 server: Shutting down VectorDBsRoutingTable INFO 2025-03-10 20:56:43,973 __main__:124 server: Shutting down VectorIORouter INFO 2025-03-10 20:56:43,974 __main__:124 server: Shutting down ToolGroupsRoutingTable INFO 2025-03-10 20:56:43,975 __main__:124 server: Shutting down ToolRuntimeRouter INFO 2025-03-10 20:56:43,976 __main__:124 server: Shutting down MetaReferenceAgentsImpl INFO 2025-03-10 20:56:43,977 __main__:124 server: Shutting down TelemetryAdapter INFO 2025-03-10 20:56:43,978 __main__:124 server: Shutting down TorchtunePostTrainingImpl WARNING 2025-03-10 20:56:43,979 __main__:129 server: No shutdown method for TorchtunePostTrainingImpl INFO 2025-03-10 20:56:43,979 __main__:124 server: Shutting down BenchmarksRoutingTable INFO 2025-03-10 20:56:43,980 __main__:124 server: Shutting down EvalRouter INFO 2025-03-10 20:56:43,981 __main__:124 server: Shutting down DistributionInspectImpl INFO: Application shutdown complete. INFO: Finished server process [33862] ``` Run with the patch and observe no warning: ``` $ kill -INT $(ps ax \| grep llama_stack.distribution.server.server \| grep -v nvim \| awk -e '{print $1}' \| sort \| head -n 1) ``` ``` INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) INFO: Shutting down INFO: Waiting for application shutdown. INFO 2025-03-11 00:32:56,863 __main__:140 server: Shutting down INFO 2025-03-11 00:32:56,864 __main__:124 server: Shutting down DatasetsRoutingTable INFO 2025-03-11 00:32:56,866 __main__:124 server: Shutting down DatasetIORouter INFO 2025-03-11 00:32:56,867 __main__:124 server: Shutting down ScoringFunctionsRoutingTable INFO 2025-03-11 00:32:56,868 __main__:124 server: Shutting down ScoringRouter INFO 2025-03-11 00:32:56,869 __main__:124 server: Shutting down ModelsRoutingTable INFO 2025-03-11 00:32:56,870 __main__:124 server: Shutting down InferenceRouter INFO 2025-03-11 00:32:56,871 __main__:124 server: Shutting down ShieldsRoutingTable INFO 2025-03-11 00:32:56,872 __main__:124 server: Shutting down SafetyRouter INFO 2025-03-11 00:32:56,873 __main__:124 server: Shutting down VectorDBsRoutingTable INFO 2025-03-11 00:32:56,874 __main__:124 server: Shutting down VectorIORouter INFO 2025-03-11 00:32:56,875 __main__:124 server: Shutting down ToolGroupsRoutingTable INFO 2025-03-11 00:32:56,876 __main__:124 server: Shutting down ToolRuntimeRouter INFO 2025-03-11 00:32:56,877 __main__:124 server: Shutting down MetaReferenceAgentsImpl INFO 2025-03-11 00:32:56,878 __main__:124 server: Shutting down TelemetryAdapter INFO 2025-03-11 00:32:56,879 __main__:124 server: Shutting down TorchtunePostTrainingImpl INFO 2025-03-11 00:32:56,880 __main__:124 server: Shutting down BenchmarksRoutingTable INFO 2025-03-11 00:32:56,881 __main__:124 server: Shutting down EvalRouter INFO 2025-03-11 00:32:56,882 __main__:124 server: Shutting down DistributionInspectImpl ``` [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-03-11 10:01:09 -07:00
Dinesh Yeduguru	ead9397e22	fix: tracing fixes for trace context propogation across coroutines (#1522 ) # What does this PR do? This PR has two fixes needed for correct trace context propagation across asycnio boundary Fix 1: Start using context vars to store the global trace context. This is needed since we cannot use the same trace context across coroutines since the state is shared. each coroutine should have its own trace context so that each of it can start storing its state correctly. Fix 2: Start a new span for each new coroutines started for running shields to keep the span tree clean ## Test Plan ### Integration tests with server LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/together/together-run.yaml LLAMA_STACK_CONFIG=http://localhost:8321 pytest -s --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct server logs: https://gist.github.com/dineshyv/51ac5d9864ed031d0d89ce77352821fe test logs: https://gist.github.com/dineshyv/e66acc1c4648a42f1854600609c467f3 ### Integration tests with library client LLAMA_STACK_CONFIG=fireworks pytest -s --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct logs: https://gist.github.com/dineshyv/ca160696a0b167223378673fb1dcefb8 ### Apps test with server: ``` LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/together/together-run.yaml python -m examples.agents.e2e_loop_with_client_tools localhost 8321 ``` server logs: https://gist.github.com/dineshyv/1717a572d8f7c14279c36123b79c5797 app logs: https://gist.github.com/dineshyv/44167e9f57806a0ba3b710c32aec02f8	2025-03-11 07:12:48 -07:00
Botao Chen	e3edca7739	feat: [new open benchmark] Math 500 (#1538 ) ## What does this PR do? Created a new math_500 open-benchmark based on OpenAI's [Let's Verify Step by Step](https://arxiv.org/abs/2305.20050) paper and hugging face's [HuggingFaceH4/MATH-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) dataset. The challenge part of this benchmark is to parse the generated and expected answer and verify if they are same. For the parsing part, we refer to [Minerva: Solving Quantitative Reasoning Problems with Language Models](https://research.google/blog/minerva-solving-quantitative-reasoning-problems-with-language-models/). To simply the parse logic, as the next step, we plan to also refer to what [simple-eval](https://github.com/openai/simple-evals) is doing, using llm as judge to check if the generated answer matches the expected answer or not ## Test Plan on sever side, spin up a server with open-benchmark template `llama stack run llama_stack/templates/open-benchamrk/run.yaml` on client side, issue an open benchmark eval request `llama-stack-client --endpoint xxx eval run-benchmark "meta-reference-math-500" --model-id "meta-llama/Llama-3.3-70B-Instruct" --output-dir "/home/markchen1015/" --num-examples 20` and get ther aggregated eval results <img width="238" alt="Screenshot 2025-03-10 at 7 57 04 PM" src="https://github.com/user-attachments/assets/2c9da042-3b70-470e-a7c4-69f4cc24d1fb" /> check the generated answer and the related scoring and they make sense	2025-03-10 20:38:28 -07:00
Ashwin Bharambe	dc84bc755a	fix: revert to using faiss for ollama distro (#1530 ) This is unfortunate because `sqlite-vec` seems promising. But its PIP package is not quite complete. It does not have binary for arm64 (I think, or maybe it even lacks 64 bit builds?) which results in the arm64 container resulting in ``` File "/usr/local/lib/python3.10/site-packages/sqlite_vec/init.py", line 17, in load conn.load_extension(loadable_path()) sqlite3.OperationalError: /usr/local/lib/python3.10/site-packages/sqlite_vec/vec0.so: wrong ELF class: ELFCLASS32 ``` To get around I tried to install from source via `uv pip install sqlite-vec --no-binary=sqlite-vec` however it even lacks a source distribution which makes that impossible. ## Test Plan Build the container locally using: ```bash LLAMA_STACK_DIR=. llama stack build --template ollama --image-type container ``` Run the container as: ``` podman run --privileged -it -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ -v ~/.llama:/root/.llama \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ --env OLLAMA_URL=http://host.containers.internal:11434 \ -v ~/local/llama-stack:/app/llama-stack-source localhost/distribution-ollama:dev --port $LLAMA_STACK_PORT ``` Verify the container starts up correctly. Without this patch, it would encounter the ELFCLASS32 error.	2025-03-10 16:15:17 -07:00
Sarthak Deshpande	921f8b1125	chore: Together async client (#1510 ) # What does this PR do? Uses together async client instead of sync client [//]: # (If resolving an issue, uncomment and update the line below) ## Test Plan Command to run the test is in the image below(2 tests fail, and they were failing for the old stable version as well with the same errors.) <img width="1689" alt="image" src="https://github.com/user-attachments/assets/503db720-5379-425d-9844-0225010e41a1" /> [//]: # (## Documentation) --------- Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>	2025-03-10 15:25:01 -07:00
Sarthak Deshpande	a9c5d3cd3d	chore: made inbuilt tools blocking calls into async non blocking calls (#1509 ) # What does this PR do? This PR converts blocking calls for in built tools like wolfram, brave, tavily and bing into non blocking async calls [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] pytest -s -v tool_runtime/test_builtin_tools.py --stack-config=together --text-model=meta-llama/Llama-3.1-8B-Instruct Used the command above to get the below results <img width="1710" alt="image" src="https://github.com/user-attachments/assets/76b0ca06-f6e4-45fa-a114-0449bef2325b" /> <img width="1389" alt="image" src="https://github.com/user-attachments/assets/5220ccbb-7882-4240-b17e-f362ad46d25b" /> <img width="1432" alt="image" src="https://github.com/user-attachments/assets/bb93a41e-e82a-4c98-a22d-6b0e320aa974" /> [//]: # (## Documentation) --------- Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>	2025-03-09 16:59:24 -07:00
Ashwin Bharambe	205661bc78	fix: Use re-entrancy and concurrency safe context managers for provider data (#1498 ) Concurrent requests should not trample (or reuse) each others' provider data. Provider data should be scoped to each request. ## Test Plan Set the uvicorn server to have a single worker process + thread by updating the config: ```python uvicorn_config = { ... "workers": 1, "loop": "asyncio", } ``` Then perform the following steps on `origin/main` (without this change). (1) Run the server using `llama stack run dev` without having `FIREWORKS_API_KEY` in the environment. (2) Run a test by specifying the FIREWORKS_API_KEY env var so it gets stored in the thread local ``` pytest -s -v tests/integration/inference/test_text_inference.py \ --stack-config http://localhost:8321 \ --text-model accounts/fireworks/models/llama-v3p1-8b-instruct \ -k test_text_chat_completion_with_tool_calling_and_streaming \ --env FIREWORKS_API_KEY=<...> ``` Ensure you don't have any other API keys in the environment (otherwise the bug will not reproduce due to other specifics in our testing code.) Verify this works. (3) Run the same command again without specifying FIREWORKS_API_KEY. See that the request actually succeeds when it should have failed. ---- Now do the same tests on this branch, verify step (3) results in failure. Finally, run the full `test_text_inference.py` test suite with this change, verify it succeeds.	2025-03-08 22:56:30 -08:00
ehhuang	23e39cc3c4	fix: handle log errors (#1499 ) Summary: \| File "/Users/erichuang/projects/llama-stack/llama_stack/distribution/server/server.py", line 213, in sse_generator \| logger.exception(f"Error in sse_generator: {e}") \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 1864, in exception \| self.log(ERROR, msg, args, exc_info=exc_info, kwargs) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 1879, in log \| self.logger.log(level, msg, args, kwargs) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 1547, in log \| self._log(level, msg, args, kwargs) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 1624, in _log \| self.handle(record) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 1634, in handle \| self.callHandlers(record) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 1696, in callHandlers \| hdlr.handle(record) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py", line 968, in handle \| self.emit(record) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py", line 167, in emit \| message_renderable = self.render_message(record, message) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py", line 193, in render_message \| message_text = Text.from_markup(message) if use_markup else Text(message) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/text.py", line 287, in from_markup \| rendered_text = render(text, style, emoji=emoji, emoji_variant=emoji_variant) \| File "/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/markup.py", line 167, in render \| raise MarkupError( \| rich.errors.MarkupError: closing tag '[/INST]' at position 105 doesn't match any open tag Test Plan: reran failing rag_with_vector_db example	2025-03-07 15:58:26 -08:00
Fred Reiss	a8d0cdaf37	feat: updated inline vllm inference provider (#880 ) # What does this PR do? This PR updates the inline vLLM inference provider in several significant ways: * Models are now attached at run time to instances of the provider via the `.../models` API instead of hard-coding the model's full name into the provider's YAML configuration. * The provider supports models that are not Meta Llama models. Any model that vLLM supports can be loaded by passing Huggingface coordinates in the "provider_model_id" field. Custom fine-tuned versions of Meta Llama models can be loaded by specifying a path on local disk in the "provider_model_id". * To implement full chat completions support, including tool calling and constrained decoding, the provider now routes the `chat_completions` API to a captive (i.e. called directly in-process, not via HTTPS) instance of vLLM's OpenAI-compatible server . * The `logprobs` parameter and completions API are also working. ## Test Plan Existing tests in `llama_stack/providers/tests/inference/test_text_inference.py` have good coverage of the new functionality. These tests can be invoked as follows: ``` cd llama-stack && pytest \ -vvv \ llama_stack/providers/tests/inference/test_text_inference.py \ --providers inference=vllm \ --inference-model meta-llama/Llama-3.2-3B-Instruct ====================================== test session starts ====================================== platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12 cachedir: .pytest_cache metadata: {'Python': '3.12.8', 'Platform': 'Linux-6.8.0-1016-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.2'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'} rootdir: /mnt/datadisk1/freiss/llama/llama-stack configfile: pyproject.toml plugins: anyio-4.8.0, html-4.1.1, metadata-3.1.1, asyncio-0.25.2 asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None collected 9 items llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm] PASSED [ 11%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm] PASSED [ 22%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm] PASSED [ 33%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm] PASSED [ 44%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm] PASSED [ 55%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm] PASSED [ 66%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm] PASSED [ 77%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm] PASSED [ 88%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm] PASSED [100%] =========================== 9 passed, 13 warnings in 97.18s (0:01:37) =========================== ``` ## Sources ## Before submitting - [X] Ran pre-commit to handle lint / formatting issues. - [X] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests. --------- Co-authored-by: Sébastien Han <seb@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-03-07 13:38:23 -08:00
ehhuang	acbae66b9d	chore: escape tool output for logging (#1490 ) Summary: error: llama_stack/providers/inline/agents/meta_reference/agent_instance.py:1032: in execute_tool_call_maybe logger.info(f"tool call {name} completed with result: {result}") /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1841: in info self.log(INFO, msg, args, kwargs) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1879: in log self.logger.log(level, msg, args, kwargs) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1547: in log self._log(level, msg, args, kwargs) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1624: in _log self.handle(record) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1634: in handle self.callHandlers(record) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1696: in callHandlers hdlr.handle(record) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:968: in handle self.emit(record) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py:167: in emit message_renderable = self.render_message(record, message) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py:193: in render_message message_text = Text.from_markup(message) if use_markup else Text(message) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/text.py:287: in from_markup rendered_text = render(text, style, emoji=emoji, emoji_variant=emoji_variant) /opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/markup.py:167: in render raise MarkupError( E rich.errors.MarkupError: closing tag '[/INST]' at position 3274 doesn't match any open tag Test Plan:	2025-03-07 13:33:45 -08:00
Xi Yan	5a2b9e121c	fix: return result for together's get_params (#1484 ) # What does this PR do? - return results for together's get_params - fix issue <img width="1538" alt="image" src="https://github.com/user-attachments/assets/c4cd3802-85ef-4ff3-b2fd-76737be2e4ff" /> - the `return params` was accidentally deleted in https://github.com/meta-llama/llama-stack/pull/1362/files#diff-d9345410ea64589cee96487b22eab0d45f7497a80c25dca295cecd254decb204 [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` npm test examples ``` [//]: # (## Documentation)	2025-03-07 12:52:26 -08:00
Ben Browning	d86a893ead	fix: Swap to AsyncOpenAI client in remote vllm provider (#1459 ) # What does this PR do? This switches from an OpenAI client to the AsyncOpenAI client in the remote vllm provider. The main benefit of this is that instead of each client call being a blocking operation that was blocking our server event loop, the client calls are now async operations that do not block the event loop. The actual fix is quite simple and straightforward. Creating a reliable reproducer of this with a unit test that verifies we were blocking the event loop before and are not blocking it any longer was a bit harder. Some other inference providers have this same issue, so we may want to make that simple delayed http server a bit more generic and pull it into a common place as other inference providers get fixed. (Closes #1457) ## Test Plan I verified the unit tests and test_text_inference tests pass with this change like below: ``` python -m pytest -v tests/unit ``` ``` VLLM_URL="http://localhost:8000/v1" \ INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \ LLAMA_STACK_CONFIG=remote-vllm \ python -m pytest -v -s \ tests/integration/inference/test_text_inference.py \ --text-model "meta-llama/Llama-3.2-3B-Instruct" ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-03-07 14:48:00 -05:00
Sébastien Han	7cf1e24c4e	feat(logging): implement category-based logging (#1362 ) # What does this PR do? This commit introduces a new logging system that allows loggers to be assigned a category while retaining the logger name based on the file name. The log format includes both the logger name and the category, producing output like: ``` INFO 2025-03-03 21:44:11,323 llama_stack.distribution.stack:103 [core]: Tool_groups: builtin::websearch served by tavily-search ``` Key features include: - Category-based logging: Loggers can be assigned a category (e.g., "core", "server") when programming. The logger can be loaded like this: `logger = get_logger(name=__name__, category="server")` - Environment variable control: Log levels can be configured per-category using the `LLAMA_STACK_LOGGING` environment variable. For example: `LLAMA_STACK_LOGGING="server=DEBUG;core=debug"` enables DEBUG level for the "server" and "core" categories. - `LLAMA_STACK_LOGGING="all=debug"` sets DEBUG level globally for all categories and third-party libraries. This provides fine-grained control over logging levels while maintaining a clean and informative log format. The formatter uses the rich library which provides nice colors better stack traces like so: ``` ERROR 2025-03-03 21:49:37,124 asyncio:1758 [uncategorized]: unhandled exception during asyncio.run() shutdown task: <Task finished name='Task-16' coro=<handle_signal.<locals>.shutdown() done, defined at /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:146> exception=UnboundLocalError("local variable 'loop' referenced before assignment")> ╭────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────╮ │ /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:178 in shutdown │ │ │ │ 175 │ │ except asyncio.CancelledError: │ │ 176 │ │ │ pass │ │ 177 │ │ finally: │ │ ❱ 178 │ │ │ loop.stop() │ │ 179 │ │ │ 180 │ loop = asyncio.get_running_loop() │ │ 181 │ loop.create_task(shutdown()) │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ UnboundLocalError: local variable 'loop' referenced before assignment ``` Co-authored-by: Ashwin Bharambe <@ashwinb> Signed-off-by: Sébastien Han <seb@redhat.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml INFO 2025-03-03 21:55:35,918 __main__:365 [server]: Using config file: llama_stack/templates/ollama/run.yaml INFO 2025-03-03 21:55:35,925 __main__:378 [server]: Run configuration: INFO 2025-03-03 21:55:35,928 __main__:380 [server]: apis: - agents ``` [//]: # (## Documentation) --------- Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-03-07 11:34:30 -08:00
Dinesh Yeduguru	60e7f3d705	fix: Revert "feat: record token usage for inference API (#1300 )" (#1476 ) This reverts commit `b8535417e0`. Test plan: LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/together/together-run.yaml python -m examples.agents.e2e_loop_with_client_tools localhost 8321	2025-03-07 10:16:47 -08:00
Ashwin Bharambe	330cc9d09d	feat: add Milvus vectorDB (#1467 ) # What does this PR do? See https://github.com/meta-llama/llama-stack/pull/1171 which is the original PR. Author: @zc277584121 feat: add [Milvus](https://milvus.io/) vectorDB note: I use the MilvusClient to implement it instead of AsyncMilvusClient, because when I tested AsyncMilvusClient, it would raise issues about evenloop, which I think AsyncMilvusClient SDK is not robust enough to be compatible with llama_stack framework. ## Test Plan have passed the unit test and ene2end test Here is my end2end test logs, including the client code, client log, server logs from inline and remote settings [test_end2end_logs.zip](https://github.com/user-attachments/files/18964391/test_end2end_logs.zip) --------- Signed-off-by: ChengZi <chen.zhang@zilliz.com> Co-authored-by: Cheney Zhang <chen.zhang@zilliz.com>	2025-03-06 20:59:31 -08:00
Ihar Hrachyshka	8234cdf1a5	fix(deps): move chardet and pypdf imports inline where used (#1434 ) # What does this PR do? Fix import errors due to `chardet` and `pypdf` not being installed while imported from `url_utils.py`. Closes #1432 ## Test Plan Now able to run the server with the config. [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-03-06 17:09:14 -08:00
Sébastien Han	803bf0e029	fix: solve ruff B008 warnings (#1444 ) # What does this PR do? The commit addresses the Ruff warning B008 by refactoring the code to avoid calling SamplingParams() directly in function argument defaults. Instead, it either uses Field(default_factory=SamplingParams) for Pydantic models or sets the default to None and instantiates SamplingParams inside the function body when the argument is None. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-03-06 16:48:35 -08:00
Xi Yan	bcb13c492f	test: revamp eval related integration tests (#1433 ) # What does this PR do? - revamp and clean up datasets/scoring/eval integration tests - closes https://github.com/meta-llama/llama-stack/issues/1396 [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan dataset ``` LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/integration/datasetio/ ``` <img width="842" alt="image" src="https://github.com/user-attachments/assets/88fc2b6a-b496-47bf-bc0c-8fea48ba36ff" /> scoring ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/scoring --text-model meta-llama/Llama-3.1-8B-Instruct --judge-model meta-llama/Llama-3.1-8B-Instruct ``` <img width="851" alt="image" src="https://github.com/user-attachments/assets/50f46415-b44c-4c37-a6c3-076f2767adb3" /> eval ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/eval --text-model meta-llama/Llama-3.1-8B-Instruct --judge-model meta-llama/Llama-3.1-8B-Instruct ``` <img width="841" alt="image" src="https://github.com/user-attachments/assets/8eb1c65c-3b39-4d66-8ff4-f471ca783e49" /> [//]: # (## Documentation)	2025-03-06 10:51:35 -08:00
Ashwin Bharambe	2fe976ed0a	refactor(test): introduce --stack-config and simplify options (#1404 ) You now run the integration tests with these options: ```bash Custom options: --stack-config=STACK_CONFIG a 'pointer' to the stack. this can be either be: (a) a template name like `fireworks`, or (b) a path to a run.yaml file, or (c) an adhoc config spec, e.g. `inference=fireworks,safety=llama-guard,agents=meta- reference` --env=ENV Set environment variables, e.g. --env KEY=value --text-model=TEXT_MODEL comma-separated list of text models. Fixture name: text_model_id --vision-model=VISION_MODEL comma-separated list of vision models. Fixture name: vision_model_id --embedding-model=EMBEDDING_MODEL comma-separated list of embedding models. Fixture name: embedding_model_id --safety-shield=SAFETY_SHIELD comma-separated list of safety shields. Fixture name: shield_id --judge-model=JUDGE_MODEL comma-separated list of judge models. Fixture name: judge_model_id --embedding-dimension=EMBEDDING_DIMENSION Output dimensionality of the embedding model to use for testing. Default: 384 --record-responses Record new API responses instead of using cached ones. --report=REPORT Path where the test report should be written, e.g. --report=/path/to/report.md ``` Importantly, if you don't specify any of the models (text-model, vision-model, etc.) the relevant tests will get skipped! This will make running tests somewhat more annoying since all options will need to be specified. We will make this easier by adding some easy wrapper yaml configs. ## Test Plan Example: ```bash ashwin@ashwin-mbp ~/local/llama-stack/tests/integration (unify_tests) $ LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/test_text_inference.py \ --text-model meta-llama/Llama-3.2-3B-Instruct ```	2025-03-05 17:02:02 -08:00

1 2 3 4 5 ...

576 commits