diff --git a/docs/source/cookbooks/evals.md b/docs/source/cookbooks/evals.md new file mode 100644 index 000000000..01872e6dc --- /dev/null +++ b/docs/source/cookbooks/evals.md @@ -0,0 +1,124 @@ +# Evaluations + +The Llama Stack Evaluation flow allows you to run evaluations on your GenAI application datasets or pre-registered benchmarks. + + +We introduce a new set of APIs in Llama Stack for supporting running evaluations of LLM applications. +- `/datasetio` + `/datasets` API +- `/scoring` + `/scoring_functions` API +- `/eval` + `/eval_tasks` API + +This guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for different use cases. + +## Evaluation Concepts + +The Evaluation APIs are associated with a set of Resources as shown in the following diagram. Please visit the Resources section in our [Core Concepts](../concepts/index.md) guide for better high-level understanding. + +![Eval Concepts](./resources/eval-concept.png) + +- **DatasetIO**: defines interface with datasets and data loaders. + - Associated with `Dataset` resource. +- **Scoring**: evaluate outputs of the system. + - Associated with `ScoringFunction` resource. We provide a suite of out-of-the box scoring functions and also the ability for you to add custom evaluators. These scoring functions are the core part of defining an evaluation task to output evaluation metrics. +- **Eval**: generate outputs (via Inference or Agents) and perform scoring. + - Associated with `EvalTask` resource. + + +## Running Evaluations +Use the following decision tree to decide how to use LlamaStack Evaluation flow. +![Eval Flow](./resources/eval-flow.png) + + +```{admonition} Note on Benchmark v.s. Application Evaluation +:class: tip +- **Benchmark Evaluation** is a well-defined eval-task consisting of `dataset` and `scoring_function`. The generation (inference or agent) will be done as part of evaluation. +- **Application Evaluation** assumes users already have app inputs & generated outputs. Evaluation will purely focus on scoring the generated outputs via scoring functions (e.g. LLM-as-judge). +``` + +The following examples give the quick steps to start running evaluations using the llama-stack-client CLI. + +#### Benchmark Evaluation CLI +Usage: There are 2 inputs necessary for running a benchmark eval +- `eval-task-id`: the identifier associated with the eval task. Each `EvalTask` is parametrized by + - `dataset_id`: the identifier associated with the dataset. + - `List[scoring_function_id]`: list of scoring function identifiers. +- `eval-task-config`: specifies the configuration of the model / agent to evaluate on. + + +``` +llama-stack-client eval run_benchmark \ +--eval-task-config ~/eval_task_config.json \ +--visualize +``` + + +#### Application Evaluation CLI +Usage: For running application evals, you will already have available datasets in hand from your application. You will need to specify: +- `scoring-fn-id`: List of ScoringFunction identifiers you wish to use to run on your application. +- `Dataset` used for evaluation: + - (1) `--dataset-path`: path to local file system containing datasets to run evaluation on + - (2) `--dataset-id`: pre-registered dataset in Llama Stack +- (Optional) `--scoring-params-config`: optionally parameterize scoring functions with custom params (e.g. `judge_prompt`, `judge_model`, `parsing_regexes`). + + +``` +llama-stack-client eval run_scoring ... +--dataset-path \ +--output-dir ./ +``` + +#### Defining EvalTaskConfig +The `EvalTaskConfig` are user specified config to define: +1. `EvalCandidate` to run generation on: + - `ModelCandidate`: The model will be used for generation through LlamaStack /inference API. + - `AgentCandidate`: The agentic system specified by AgentConfig will be used for generation through LlamaStack /agents API. +2. Optionally scoring function params to allow customization of scoring function behaviour. This is useful to parameterize generic scoring functions such as LLMAsJudge with custom `judge_model` / `judge_prompt`. + + +**Example Benchmark EvalTaskConfig** +```json +{ + "type": "benchmark", + "eval_candidate": { + "type": "model", + "model": "Llama3.2-3B-Instruct", + "sampling_params": { + "strategy": "greedy", + "temperature": 0, + "top_p": 0.95, + "top_k": 0, + "max_tokens": 0, + "repetition_penalty": 1.0 + } + } +} +``` + +**Example Application EvalTaskConfig** +```json +{ + "type": "app", + "eval_candidate": { + "type": "model", + "model": "Llama3.1-405B-Instruct", + "sampling_params": { + "strategy": "greedy", + "temperature": 0, + "top_p": 0.95, + "top_k": 0, + "max_tokens": 0, + "repetition_penalty": 1.0 + } + }, + "scoring_params": { + "llm-as-judge::llm_as_judge_base": { + "type": "llm_as_judge", + "judge_model": "meta-llama/Llama-3.1-8B-Instruct", + "prompt_template": "Your job is to look at a question, a gold target ........", + "judge_score_regexes": [ + "(A|B|C)" + ] + } + } +} +``` diff --git a/docs/source/cookbooks/index.md b/docs/source/cookbooks/index.md new file mode 100644 index 000000000..93405e76e --- /dev/null +++ b/docs/source/cookbooks/index.md @@ -0,0 +1,9 @@ +# Cookbooks + +- [Evaluations Flow](evals.md) + +```{toctree} +:maxdepth: 2 +:hidden: +evals.md +``` diff --git a/docs/source/cookbooks/resources/eval-concept.png b/docs/source/cookbooks/resources/eval-concept.png new file mode 100644 index 000000000..0cba25dfb Binary files /dev/null and b/docs/source/cookbooks/resources/eval-concept.png differ diff --git a/docs/source/cookbooks/resources/eval-flow.png b/docs/source/cookbooks/resources/eval-flow.png new file mode 100644 index 000000000..bd3cebdf8 Binary files /dev/null and b/docs/source/cookbooks/resources/eval-flow.png differ diff --git a/docs/source/index.md b/docs/source/index.md index 046337378..ca6e28ec2 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -95,4 +95,5 @@ concepts/index distributions/index contributing/index references/index +cookbooks/index ```