Xi Yan
|
edeb6dcf04
|
mmlu loose
|
2024-11-07 18:36:41 -08:00 |
|
Xi Yan
|
6ee02ca23b
|
fix
|
2024-11-07 18:25:39 -08:00 |
|
Xi Yan
|
33b6d9b7b7
|
merge
|
2024-11-07 18:15:13 -08:00 |
|
Xi Yan
|
94a56cc3f3
|
register task required
|
2024-11-07 16:41:23 -08:00 |
|
Xi Yan
|
fd581c3d88
|
only keep 1 run_eval
|
2024-11-07 16:17:49 -08:00 |
|
Xi Yan
|
37d87c585a
|
wip huggingface register
|
2024-11-07 15:59:55 -08:00 |
|
Xi Yan
|
d1633dc412
|
huggingface provider
|
2024-11-07 15:20:22 -08:00 |
|
Xi Yan
|
cc6edf6287
|
Merge branch 'eval_task_register' into mmlu_benchmark
|
2024-11-07 14:41:50 -08:00 |
|
Xi Yan
|
f05db9a25c
|
add eval_id for jobs
|
2024-11-07 14:30:46 -08:00 |
|
Xi Yan
|
51c20f9c29
|
api refactor
|
2024-11-07 13:54:26 -08:00 |
|
Xi Yan
|
93995ecc4c
|
test wip
|
2024-11-07 11:11:27 -08:00 |
|
Xi Yan
|
3f1ac29d57
|
test eval works
|
2024-11-06 21:40:38 -08:00 |
|
Xi Yan
|
413a1b6d8b
|
fix eval
|
2024-11-06 21:10:54 -08:00 |
|
Xi Yan
|
def6d5d8ad
|
scoring resolve
|
2024-11-06 17:04:25 -08:00 |
|
Xi Yan
|
7b8748c53e
|
[Evals API][6/n] meta-reference llm as judge, registration for ScoringFnDefs (#330)
* wip scoring refactor
* llm as judge, move folders
* test full generation + eval
* extract score regex to llm context
* remove prints, cleanup braintrust in this branch
* change json -> class
* remove initialize
* address nits
* check identifier prefix
* udpate MANIFEST
|
2024-10-28 14:08:42 -07:00 |
|
Xi Yan
|
abdf7cddf3
|
[Evals API][4/n] evals with generation meta-reference impl (#303)
* wip
* dataset validation
* test_scoring
* cleanup
* clean up test
* comments
* error checking
* dataset client
* test client:
* datasetio client
* clean up
* basic scoring function works
* scorer wip
* equality scorer
* score batch impl
* score batch
* update scoring test
* refactor
* validate scorer input
* address comments
* evals with generation
* add all rows scores to ScoringResult
* minor typing
* bugfix
* scoring function def rename
* rebase name
* refactor
* address comments
* Update iOS inference instructions for new quantization
* Small updates to quantization config
* Fix score threshold in faiss
* Bump version to 0.0.45
* Handle both ipv6 and ipv4 interfaces together
* update manifest for build templates
* Update getting_started.md
* chatcompletion & completion input type validation
* inclusion->subsetof
* error checking
* scoring_function -> scoring_fn rename, scorer -> scoring_fn rename
* address comments
* [Evals API][5/n] fixes to generate openapi spec (#323)
* generate openapi
* typing comment, dataset -> dataset_id
* remove custom type
* sample eval run.yaml
---------
Co-authored-by: Dalton Flanagan <6599399+dltn@users.noreply.github.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
|
2024-10-25 13:12:39 -07:00 |
|