llama-stack/llama_stack/providers/registry
Xi Yan 0784284ab5
[Agentic Eval] add ability to run agents generation (#469)
# What does this PR do?

- add ability to run agents generation for full eval (generate +
scoring)
- pre-register SimpleQA  benchmark llm-as-judge scoring function in code


## Test Plan


![image](https://github.com/user-attachments/assets/b4b6f086-1be4-4c2a-8ab0-6839f0067c0a)


![image](https://github.com/user-attachments/assets/05bb7a09-2d7a-4031-8eb6-e1ca670ee439)


#### Simple QA w/ Search

![image](https://github.com/user-attachments/assets/0a51e3f3-9fc7-479b-8295-89aed63496e0)

- eval_task_config_simpleqa_search.json
```json
{
    "type": "benchmark",
    "eval_candidate": {
        "type": "agent",
        "config": {
            "model": "Llama3.1-405B-Instruct",
            "instructions": "Please use the search tool to answer the question.",
            "sampling_params": {
                "strategy": "greedy",
                "temperature": 1.0,
                "top_p": 0.9
            },
            "tools": [
                {
                    "type": "brave_search",
                    "engine": "brave",
                    "api_key": "API_KEY"
                }
            ],
            "tool_choice": "auto",
            "tool_prompt_format": "json",
            "input_shields": [],
            "output_shields": [],
            "enable_session_persistence": false
        }
    }
}
```

#### SimpleQA w/o Search

![image](https://github.com/user-attachments/assets/6301feef-2abb-4bee-b50c-97da1c90482b)


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2024-11-18 11:43:03 -08:00
..
__init__.py API Updates (#73) 2024-09-17 19:51:35 -07:00
agents.py Rename all inline providers with an inline:: prefix (#423) 2024-11-11 22:19:16 -08:00
datasetio.py move hf addapter->remote (#459) 2024-11-14 22:41:19 -05:00
eval.py [Agentic Eval] add ability to run agents generation (#469) 2024-11-18 11:43:03 -08:00
inference.py fix fireworks (#427) 2024-11-12 12:15:55 -05:00
memory.py Kill "remote" providers and fix testing with a remote stack properly (#435) 2024-11-12 21:51:29 -08:00
safety.py Rename all inline providers with an inline:: prefix (#423) 2024-11-11 22:19:16 -08:00
scoring.py fix tests after registration migration & rename meta-reference -> basic / llm_as_judge provider (#424) 2024-11-12 10:35:44 -05:00
telemetry.py Rename all inline providers with an inline:: prefix (#423) 2024-11-11 22:19:16 -08:00