## Testing & Evaluation Llama Stack provides built-in tools for evaluating your applications: 1. **Benchmarking**: Test against standard datasets 2. **Application Evaluation**: Score your application's outputs 3. **Custom Metrics**: Define your own evaluation criteria Here's how to set up basic evaluation: ```python # Create an evaluation task response = client.benchmarks.register( benchmark_id="my_eval", dataset_id="my_dataset", scoring_functions=["accuracy", "relevance"], ) # Run evaluation job = client.eval.run_eval( benchmark_id="my_eval", task_config={ "type": "app", "eval_candidate": {"type": "agent", "config": agent_config}, }, ) # Get results result = client.eval.job_result(benchmark_id="my_eval", job_id=job.job_id) ```