docs: update eval doc (#1453)

# What does this PR do? - Update eval doc to reflect latest changes - Closes https://github.com/meta-llama/llama-stack/issues/1441 ## Test Plan read [//]: # (## Documentation)
2025-06-28 02:53:30 +00:00 · 2025-03-06 14:14:10 -08:00 · 2025-03-06 14:14:10 -08:00 · 564977c646
commit 564977c646
parent db4ee7a9ff
4 changed files with 140 additions and 244 deletions
--- a/docs/source/concepts/evaluation_concepts.md
+++ b/docs/source/concepts/evaluation_concepts.md
@ -24,17 +24,8 @@ The Evaluation APIs are associated with a set of Resources as shown in the follo
  - Associated with `Benchmark` resource.


-Use the following decision tree to decide how to use LlamaStack Evaluation flow.
-![Eval Flow](../references/evals_reference/resources/eval-flow.png)
-
-
-```{admonition} Note on Benchmark v.s. Application Evaluation
-:class: tip
- **Benchmark Evaluation** is a well-defined eval-task consisting of `dataset` and `scoring_function`. The generation (inference or agent) will be done as part of evaluation.
- **Application Evaluation** assumes users already have app inputs & generated outputs. Evaluation will purely focus on scoring the generated outputs via scoring functions (e.g. LLM-as-judge).
-```
-
 ## What's Next?

- Check out our Colab notebook on working examples with evaluations [here](https://colab.research.google.com/drive/10CHyykee9j2OigaIcRv47BKG9mrNm0tJ?usp=sharing).
+- Check out our Colab notebook on working examples with running benchmark evaluations [here](https://colab.research.google.com/github/meta-llama/llama-stack/blob/main/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb#scrollTo=mxLCsP4MvFqP).
+- Check out our [Building Applications - Evaluation](../building_applications/evals.md) guide for more details on how to use the Evaluation APIs to evaluate your applications.
 - Check out our [Evaluation Reference](../references/evals_reference/index.md) for more details on the APIs.