Xi Yan
|
0784284ab5
|
[Agentic Eval] add ability to run agents generation (#469)
# What does this PR do?
- add ability to run agents generation for full eval (generate +
scoring)
- pre-register SimpleQA benchmark llm-as-judge scoring function in code
## Test Plan


#### Simple QA w/ Search

- eval_task_config_simpleqa_search.json
```json
{
"type": "benchmark",
"eval_candidate": {
"type": "agent",
"config": {
"model": "Llama3.1-405B-Instruct",
"instructions": "Please use the search tool to answer the question.",
"sampling_params": {
"strategy": "greedy",
"temperature": 1.0,
"top_p": 0.9
},
"tools": [
{
"type": "brave_search",
"engine": "brave",
"api_key": "API_KEY"
}
],
"tool_choice": "auto",
"tool_prompt_format": "json",
"input_shields": [],
"output_shields": [],
"enable_session_persistence": false
}
}
}
```
#### SimpleQA w/o Search

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
|
2024-11-18 11:43:03 -08:00 |
|
Xi Yan
|
abdf7cddf3
|
[Evals API][4/n] evals with generation meta-reference impl (#303)
* wip
* dataset validation
* test_scoring
* cleanup
* clean up test
* comments
* error checking
* dataset client
* test client:
* datasetio client
* clean up
* basic scoring function works
* scorer wip
* equality scorer
* score batch impl
* score batch
* update scoring test
* refactor
* validate scorer input
* address comments
* evals with generation
* add all rows scores to ScoringResult
* minor typing
* bugfix
* scoring function def rename
* rebase name
* refactor
* address comments
* Update iOS inference instructions for new quantization
* Small updates to quantization config
* Fix score threshold in faiss
* Bump version to 0.0.45
* Handle both ipv6 and ipv4 interfaces together
* update manifest for build templates
* Update getting_started.md
* chatcompletion & completion input type validation
* inclusion->subsetof
* error checking
* scoring_function -> scoring_fn rename, scorer -> scoring_fn rename
* address comments
* [Evals API][5/n] fixes to generate openapi spec (#323)
* generate openapi
* typing comment, dataset -> dataset_id
* remove custom type
* sample eval run.yaml
---------
Co-authored-by: Dalton Flanagan <6599399+dltn@users.noreply.github.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
|
2024-10-25 13:12:39 -07:00 |
|