[docs] evals (#511)

# What does this PR do? - add evals docs ## Test Plan https://github.com/user-attachments/assets/7a1bcfcc-2c37-4cd2-9a72-bf43c2321022 ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.
2024-11-22 21:09:39 -08:00 · 2024-11-22 21:09:39 -08:00 · 988f424c9c
commit 988f424c9c
parent 0481fa9540
5 changed files with 134 additions and 0 deletions
--- a/docs/source/cookbooks/evals.md
+++ b/docs/source/cookbooks/evals.md
@ -0,0 +1,124 @@
 # Evaluations
 The Llama Stack Evaluation flow allows you to run evaluations on your GenAI application datasets or pre-registered benchmarks.
 We introduce a new set of APIs in Llama Stack for supporting running evaluations of LLM applications.
 - `/datasetio` + `/datasets` API
 - `/scoring` + `/scoring_functions` API
 - `/eval` + `/eval_tasks` API
 This guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for different use cases.
 ## Evaluation Concepts
 The Evaluation APIs are associated with a set of Resources as shown in the following diagram. Please visit the Resources section in our [Core Concepts](../concepts/index.md) guide for better high-level understanding.
 ![Eval Concepts](./resources/eval-concept.png)
 - **DatasetIO**: defines interface with datasets and data loaders.
  - Associated with `Dataset` resource.
 - **Scoring**: evaluate outputs of the system.
  - Associated with `ScoringFunction` resource. We provide a suite of out-of-the box scoring functions and also the ability for you to add custom evaluators. These scoring functions are the core part of defining an evaluation task to output evaluation metrics.
 - **Eval**: generate outputs (via Inference or Agents) and perform scoring.
  - Associated with `EvalTask` resource.
 ## Running Evaluations
 Use the following decision tree to decide how to use LlamaStack Evaluation flow.
 ![Eval Flow](./resources/eval-flow.png)
 ```{admonition} Note on Benchmark v.s. Application Evaluation
 :class: tip
 - **Benchmark Evaluation** is a well-defined eval-task consisting of `dataset` and `scoring_function`. The generation (inference or agent) will be done as part of evaluation.
 - **Application Evaluation** assumes users already have app inputs & generated outputs. Evaluation will purely focus on scoring the generated outputs via scoring functions (e.g. LLM-as-judge).
 ```
 The following examples give the quick steps to start running evaluations using the llama-stack-client CLI.
 #### Benchmark Evaluation CLI
 Usage: There are 2 inputs necessary for running a benchmark eval
 - `eval-task-id`: the identifier associated with the eval task. Each `EvalTask` is parametrized by
  - `dataset_id`: the identifier associated with the dataset.
  - `List[scoring_function_id]`: list of scoring function identifiers.
 - `eval-task-config`: specifies the configuration of the model / agent to evaluate on.
 ```
 llama-stack-client eval run_benchmark <eval-task-id> \
 --eval-task-config ~/eval_task_config.json \
 --visualize
 ```
 #### Application Evaluation CLI
 Usage: For running application evals, you will already have available datasets in hand from your application. You will need to specify:
 - `scoring-fn-id`: List of ScoringFunction identifiers you wish to use to run on your application.
 - `Dataset` used for evaluation:
  - (1) `--dataset-path`: path to local file system containing datasets to run evaluation on
  - (2) `--dataset-id`: pre-registered dataset in Llama Stack
 - (Optional) `--scoring-params-config`: optionally parameterize scoring functions with custom params (e.g. `judge_prompt`, `judge_model`, `parsing_regexes`).
 ```
 llama-stack-client eval run_scoring <scoring_fn_id_1> <scoring_fn_id_2> ... <scoring_fn_id_n>
 --dataset-path <path-to-local-dataset> \
 --output-dir ./
 ```
 #### Defining EvalTaskConfig
 The `EvalTaskConfig` are user specified config to define:
 1. `EvalCandidate` to run generation on:
   - `ModelCandidate`: The model will be used for generation through LlamaStack /inference API.
   - `AgentCandidate`: The agentic system specified by AgentConfig will be used for generation through LlamaStack  /agents API.
 2. Optionally scoring function params to allow customization of scoring function behaviour. This is useful to parameterize generic scoring functions such as LLMAsJudge with custom `judge_model` / `judge_prompt`.
 **Example Benchmark EvalTaskConfig**
 ```json
 {
    "type": "benchmark",
    "eval_candidate": {
        "type": "model",
        "model": "Llama3.2-3B-Instruct",
        "sampling_params": {
            "strategy": "greedy",
            "temperature": 0,
            "top_p": 0.95,
            "top_k": 0,
            "max_tokens": 0,
            "repetition_penalty": 1.0
        }
    }
 }
 ```
 **Example Application EvalTaskConfig**
 ```json
 {
    "type": "app",
    "eval_candidate": {
        "type": "model",
        "model": "Llama3.1-405B-Instruct",
        "sampling_params": {
            "strategy": "greedy",
            "temperature": 0,
            "top_p": 0.95,
            "top_k": 0,
            "max_tokens": 0,
            "repetition_penalty": 1.0
        }
    },
    "scoring_params": {
        "llm-as-judge::llm_as_judge_base": {
            "type": "llm_as_judge",
            "judge_model": "meta-llama/Llama-3.1-8B-Instruct",
            "prompt_template": "Your job is to look at a question, a gold target ........",
            "judge_score_regexes": [
                "(A|B|C)"
            ]
        }
    }
 }
 ```
--- a/docs/source/cookbooks/index.md
+++ b/docs/source/cookbooks/index.md
@ -0,0 +1,9 @@
 # Cookbooks
 - [Evaluations Flow](evals.md)
 ```{toctree}
 :maxdepth: 2
 :hidden:
 evals.md
 ```
--- a/docs/source/cookbooks/resources/eval-concept.png
+++ b/docs/source/cookbooks/resources/eval-concept.png
--- a/docs/source/cookbooks/resources/eval-flow.png
+++ b/docs/source/cookbooks/resources/eval-flow.png
--- a/docs/source/index.md
+++ b/docs/source/index.md
@ -95,4 +95,5 @@ concepts/index
 distributions/index
 contributing/index
 references/index
 cookbooks/index
 ```