forked from phoenix-oss/llama-stack-mirror
[docs] evals (#511)
# What does this PR do? - add evals docs ## Test Plan https://github.com/user-attachments/assets/7a1bcfcc-2c37-4cd2-9a72-bf43c2321022 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.
This commit is contained in:
parent
0481fa9540
commit
988f424c9c
5 changed files with 134 additions and 0 deletions
124
docs/source/cookbooks/evals.md
Normal file
124
docs/source/cookbooks/evals.md
Normal file
|
@ -0,0 +1,124 @@
|
||||||
|
# Evaluations
|
||||||
|
|
||||||
|
The Llama Stack Evaluation flow allows you to run evaluations on your GenAI application datasets or pre-registered benchmarks.
|
||||||
|
|
||||||
|
|
||||||
|
We introduce a new set of APIs in Llama Stack for supporting running evaluations of LLM applications.
|
||||||
|
- `/datasetio` + `/datasets` API
|
||||||
|
- `/scoring` + `/scoring_functions` API
|
||||||
|
- `/eval` + `/eval_tasks` API
|
||||||
|
|
||||||
|
This guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for different use cases.
|
||||||
|
|
||||||
|
## Evaluation Concepts
|
||||||
|
|
||||||
|
The Evaluation APIs are associated with a set of Resources as shown in the following diagram. Please visit the Resources section in our [Core Concepts](../concepts/index.md) guide for better high-level understanding.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
- **DatasetIO**: defines interface with datasets and data loaders.
|
||||||
|
- Associated with `Dataset` resource.
|
||||||
|
- **Scoring**: evaluate outputs of the system.
|
||||||
|
- Associated with `ScoringFunction` resource. We provide a suite of out-of-the box scoring functions and also the ability for you to add custom evaluators. These scoring functions are the core part of defining an evaluation task to output evaluation metrics.
|
||||||
|
- **Eval**: generate outputs (via Inference or Agents) and perform scoring.
|
||||||
|
- Associated with `EvalTask` resource.
|
||||||
|
|
||||||
|
|
||||||
|
## Running Evaluations
|
||||||
|
Use the following decision tree to decide how to use LlamaStack Evaluation flow.
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
```{admonition} Note on Benchmark v.s. Application Evaluation
|
||||||
|
:class: tip
|
||||||
|
- **Benchmark Evaluation** is a well-defined eval-task consisting of `dataset` and `scoring_function`. The generation (inference or agent) will be done as part of evaluation.
|
||||||
|
- **Application Evaluation** assumes users already have app inputs & generated outputs. Evaluation will purely focus on scoring the generated outputs via scoring functions (e.g. LLM-as-judge).
|
||||||
|
```
|
||||||
|
|
||||||
|
The following examples give the quick steps to start running evaluations using the llama-stack-client CLI.
|
||||||
|
|
||||||
|
#### Benchmark Evaluation CLI
|
||||||
|
Usage: There are 2 inputs necessary for running a benchmark eval
|
||||||
|
- `eval-task-id`: the identifier associated with the eval task. Each `EvalTask` is parametrized by
|
||||||
|
- `dataset_id`: the identifier associated with the dataset.
|
||||||
|
- `List[scoring_function_id]`: list of scoring function identifiers.
|
||||||
|
- `eval-task-config`: specifies the configuration of the model / agent to evaluate on.
|
||||||
|
|
||||||
|
|
||||||
|
```
|
||||||
|
llama-stack-client eval run_benchmark <eval-task-id> \
|
||||||
|
--eval-task-config ~/eval_task_config.json \
|
||||||
|
--visualize
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
#### Application Evaluation CLI
|
||||||
|
Usage: For running application evals, you will already have available datasets in hand from your application. You will need to specify:
|
||||||
|
- `scoring-fn-id`: List of ScoringFunction identifiers you wish to use to run on your application.
|
||||||
|
- `Dataset` used for evaluation:
|
||||||
|
- (1) `--dataset-path`: path to local file system containing datasets to run evaluation on
|
||||||
|
- (2) `--dataset-id`: pre-registered dataset in Llama Stack
|
||||||
|
- (Optional) `--scoring-params-config`: optionally parameterize scoring functions with custom params (e.g. `judge_prompt`, `judge_model`, `parsing_regexes`).
|
||||||
|
|
||||||
|
|
||||||
|
```
|
||||||
|
llama-stack-client eval run_scoring <scoring_fn_id_1> <scoring_fn_id_2> ... <scoring_fn_id_n>
|
||||||
|
--dataset-path <path-to-local-dataset> \
|
||||||
|
--output-dir ./
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Defining EvalTaskConfig
|
||||||
|
The `EvalTaskConfig` are user specified config to define:
|
||||||
|
1. `EvalCandidate` to run generation on:
|
||||||
|
- `ModelCandidate`: The model will be used for generation through LlamaStack /inference API.
|
||||||
|
- `AgentCandidate`: The agentic system specified by AgentConfig will be used for generation through LlamaStack /agents API.
|
||||||
|
2. Optionally scoring function params to allow customization of scoring function behaviour. This is useful to parameterize generic scoring functions such as LLMAsJudge with custom `judge_model` / `judge_prompt`.
|
||||||
|
|
||||||
|
|
||||||
|
**Example Benchmark EvalTaskConfig**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "benchmark",
|
||||||
|
"eval_candidate": {
|
||||||
|
"type": "model",
|
||||||
|
"model": "Llama3.2-3B-Instruct",
|
||||||
|
"sampling_params": {
|
||||||
|
"strategy": "greedy",
|
||||||
|
"temperature": 0,
|
||||||
|
"top_p": 0.95,
|
||||||
|
"top_k": 0,
|
||||||
|
"max_tokens": 0,
|
||||||
|
"repetition_penalty": 1.0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example Application EvalTaskConfig**
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"type": "app",
|
||||||
|
"eval_candidate": {
|
||||||
|
"type": "model",
|
||||||
|
"model": "Llama3.1-405B-Instruct",
|
||||||
|
"sampling_params": {
|
||||||
|
"strategy": "greedy",
|
||||||
|
"temperature": 0,
|
||||||
|
"top_p": 0.95,
|
||||||
|
"top_k": 0,
|
||||||
|
"max_tokens": 0,
|
||||||
|
"repetition_penalty": 1.0
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"scoring_params": {
|
||||||
|
"llm-as-judge::llm_as_judge_base": {
|
||||||
|
"type": "llm_as_judge",
|
||||||
|
"judge_model": "meta-llama/Llama-3.1-8B-Instruct",
|
||||||
|
"prompt_template": "Your job is to look at a question, a gold target ........",
|
||||||
|
"judge_score_regexes": [
|
||||||
|
"(A|B|C)"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
9
docs/source/cookbooks/index.md
Normal file
9
docs/source/cookbooks/index.md
Normal file
|
@ -0,0 +1,9 @@
|
||||||
|
# Cookbooks
|
||||||
|
|
||||||
|
- [Evaluations Flow](evals.md)
|
||||||
|
|
||||||
|
```{toctree}
|
||||||
|
:maxdepth: 2
|
||||||
|
:hidden:
|
||||||
|
evals.md
|
||||||
|
```
|
BIN
docs/source/cookbooks/resources/eval-concept.png
Normal file
BIN
docs/source/cookbooks/resources/eval-concept.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 68 KiB |
BIN
docs/source/cookbooks/resources/eval-flow.png
Normal file
BIN
docs/source/cookbooks/resources/eval-flow.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 249 KiB |
|
@ -95,4 +95,5 @@ concepts/index
|
||||||
distributions/index
|
distributions/index
|
||||||
contributing/index
|
contributing/index
|
||||||
references/index
|
references/index
|
||||||
|
cookbooks/index
|
||||||
```
|
```
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue