llama-stack

forked from phoenix-oss/llama-stack-mirror

History

Xi Yan 0784284ab5 [Agentic Eval] add ability to run agents generation (#469 ) # What does this PR do? - add ability to run agents generation for full eval (generate + scoring) - pre-register SimpleQA benchmark llm-as-judge scoring function in code ## Test Plan ![image](https://github.com/user-attachments/assets/b4b6f086-1be4-4c2a-8ab0-6839f0067c0a) ![image](https://github.com/user-attachments/assets/05bb7a09-2d7a-4031-8eb6-e1ca670ee439) #### Simple QA w/ Search ![image](https://github.com/user-attachments/assets/0a51e3f3-9fc7-479b-8295-89aed63496e0) - eval_task_config_simpleqa_search.json ```json { "type": "benchmark", "eval_candidate": { "type": "agent", "config": { "model": "Llama3.1-405B-Instruct", "instructions": "Please use the search tool to answer the question.", "sampling_params": { "strategy": "greedy", "temperature": 1.0, "top_p": 0.9 }, "tools": [ { "type": "brave_search", "engine": "brave", "api_key": "API_KEY" } ], "tool_choice": "auto", "tool_prompt_format": "json", "input_shields": [], "output_shields": [], "enable_session_persistence": false } } } ``` #### SimpleQA w/o Search ![image](https://github.com/user-attachments/assets/6301feef-2abb-4bee-b50c-97da1c90482b) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.		2024-11-18 11:43:03 -08:00
..
__init__.py	API Updates (#73 )	2024-09-17 19:51:35 -07:00
agents.py	Rename all inline providers with an inline:: prefix (#423 )	2024-11-11 22:19:16 -08:00
datasetio.py	move hf addapter->remote (#459 )	2024-11-14 22:41:19 -05:00
eval.py	[Agentic Eval] add ability to run agents generation (#469 )	2024-11-18 11:43:03 -08:00
inference.py	fix fireworks (#427 )	2024-11-12 12:15:55 -05:00
memory.py	Kill "remote" providers and fix testing with a remote stack properly (#435 )	2024-11-12 21:51:29 -08:00
safety.py	Rename all inline providers with an inline:: prefix (#423 )	2024-11-11 22:19:16 -08:00
scoring.py	fix tests after registration migration & rename meta-reference -> basic / llm_as_judge provider (#424 )	2024-11-12 10:35:44 -05:00
telemetry.py	Rename all inline providers with an inline:: prefix (#423 )	2024-11-11 22:19:16 -08:00