Jash Gulabrai 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								cc77f79f55 
								
							 
						 
						
							
							
								
								feat: Add NVIDIA Eval integration ( #1890 )  
							
							... 
							
							
							
							# What does this PR do?
This PR adds support for NVIDIA's NeMo Evaluator API to the Llama Stack
eval module. The integration enables users to evaluate models via the
Llama Stack interface.
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
1. Added unit tests and successfully ran from root of project:
`./scripts/unit-tests.sh tests/unit/providers/nvidia/test_eval.py`
```
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_cancel PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_result PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_status PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_register_benchmark PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_run_eval PASSED
```
2. Verified I could build the Llama Stack image: `LLAMA_STACK_DIR=$(pwd)
llama stack build --template nvidia --image-type venv`
Documentation added to
`llama_stack/providers/remote/eval/nvidia/README.md`
---------
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com> 
							
						 
						
							2025-04-24 17:12:42 -07:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Botao Chen 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								f369871083 
								
							 
						 
						
							
							
								
								feat: [New Eval Benchamark] IfEval ( #1708 )  
							
							... 
							
							
							
							# What does this PR do?
In this PR, we added a new eval open benchmark IfEval based on paper
https://arxiv.org/abs/2311.07911  to measure the model capability of
instruction following.
## Test Plan
spin up a llama stack server with open-benchmark template
run `llama-stack-client --endpoint xxx eval run-benchmark
"meta-reference-ifeval" --model-id "meta-llama/Llama-3.3-70B-Instruct"
--output-dir "/home/markchen1015/" --num-examples 20` on client side and
get the eval aggregate results 
							
						 
						
							2025-03-19 16:39:59 -07:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									yyymeta 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								a626b7bce3 
								
							 
						 
						
							
							
								
								feat: [new open benchmark]  BFCL_v3 ( #1578 )  
							
							... 
							
							
							
							# What does this PR do?
create a new dataset BFCL_v3 from
https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html 
overall each question asks the model to perform a task described in
natural language, and additionally a set of available functions and
their schema are given for the model to choose from. the model is
required to write the function call form including function name and
parameters , to achieve the stated purpose. the results are validated
against provided ground truth, to make sure that the generated function
call and the ground truth function call are syntactically and
semantically equivalent, by checking their AST .
## Test Plan
start server by 
```
llama stack run ./llama_stack/templates/ollama/run.yaml
```
then send traffic
```
 llama-stack-client eval run-benchmark "bfcl"  --model-id   meta-llama/Llama-3.2-3B-Instruct    --output-dir /tmp/gpqa    --num-examples   2
```
[//]: # (## Documentation) 
							
						 
						
							2025-03-14 12:50:49 -07:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Xi Yan 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								3c72c034e6 
								
							 
						 
						
							
							
								
								[remove import *] clean up import *'s ( #689 )  
							
							... 
							
							
							
							# What does this PR do?
- as title, cleaning up `import *`'s
- upgrade tests to make them more robust to bad model outputs
- remove import *'s in llama_stack/apis/* (skip __init__ modules)
<img width="465" alt="image"
src="https://github.com/user-attachments/assets/d8339c13-3b40-4ba5-9c53-0d2329726ee2 "
/>
- run `sh run_openapi_generator.sh`, no types gets affected
## Test Plan
### Providers Tests
**agents**
```
pytest -v -s llama_stack/providers/tests/agents/test_agents.py -m "together" --safety-shield meta-llama/Llama-Guard-3-8B --inference-model meta-llama/Llama-3.1-405B-Instruct-FP8
```
**inference**
```bash
# meta-reference
torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py
# together
pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py
pytest ./llama_stack/providers/tests/inference/test_prompt_adapter.py 
```
**safety**
```
pytest -v -s llama_stack/providers/tests/safety/test_safety.py -m together --safety-shield meta-llama/Llama-Guard-3-8B
```
**memory**
```
pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m "sentence_transformers" --env EMBEDDING_DIMENSION=384
```
**scoring**
```
pytest -v -s -m llm_as_judge_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct
pytest -v -s -m basic_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py
pytest -v -s -m braintrust_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py
```
**datasetio**
```
pytest -v -s -m localfs llama_stack/providers/tests/datasetio/test_datasetio.py
pytest -v -s -m huggingface llama_stack/providers/tests/datasetio/test_datasetio.py
```
**eval**
```
pytest -v -s -m meta_reference_eval_together_inference llama_stack/providers/tests/eval/test_eval.py
pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio llama_stack/providers/tests/eval/test_eval.py
```
### Client-SDK Tests
```
LLAMA_STACK_BASE_URL=http://localhost:5000  pytest -v ./tests/client-sdk
```
### llama-stack-apps
```
PORT=5000
LOCALHOST=localhost
python -m examples.agents.hello $LOCALHOST $PORT
python -m examples.agents.inflation $LOCALHOST $PORT
python -m examples.agents.podcast_transcript $LOCALHOST $PORT
python -m examples.agents.rag_as_attachments $LOCALHOST $PORT
python -m examples.agents.rag_with_memory_bank $LOCALHOST $PORT
python -m examples.safety.llama_guard_demo_mm $LOCALHOST $PORT
python -m examples.agents.e2e_loop_with_custom_tools $LOCALHOST $PORT
# Vision model
python -m examples.interior_design_assistant.app
python -m examples.agent_store.app $LOCALHOST $PORT
```
### CLI
```
which llama
llama model prompt-format -m Llama3.2-11B-Vision-Instruct
llama model list
llama stack list-apis
llama stack list-providers inference
llama stack build --template ollama --image-type conda
```
### Distributions Tests
**ollama**
```
llama stack build --template ollama --image-type conda
ollama run llama3.2:1b-instruct-fp16
llama stack run ./llama_stack/templates/ollama/run.yaml --env INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct
```
**fireworks**
```
llama stack build --template fireworks --image-type conda
llama stack run ./llama_stack/templates/fireworks/run.yaml
```
**together**
```
llama stack build --template together --image-type conda
llama stack run ./llama_stack/templates/together/run.yaml
```
**tgi**
```
llama stack run ./llama_stack/templates/tgi/run.yaml --env TGI_URL=http://0.0.0.0:5009  --env INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct
```
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md ),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests. 
							
						 
						
							2024-12-27 15:45:44 -08:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Xi Yan 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								0784284ab5 
								
							 
						 
						
							
							
								
								[Agentic Eval] add ability to run agents generation ( #469 )  
							
							... 
							
							
							
							# What does this PR do?
- add ability to run agents generation for full eval (generate +
scoring)
- pre-register SimpleQA  benchmark llm-as-judge scoring function in code
## Test Plan


#### Simple QA w/ Search

- eval_task_config_simpleqa_search.json
```json
{
    "type": "benchmark",
    "eval_candidate": {
        "type": "agent",
        "config": {
            "model": "Llama3.1-405B-Instruct",
            "instructions": "Please use the search tool to answer the question.",
            "sampling_params": {
                "strategy": "greedy",
                "temperature": 1.0,
                "top_p": 0.9
            },
            "tools": [
                {
                    "type": "brave_search",
                    "engine": "brave",
                    "api_key": "API_KEY"
                }
            ],
            "tool_choice": "auto",
            "tool_prompt_format": "json",
            "input_shields": [],
            "output_shields": [],
            "enable_session_persistence": false
        }
    }
}
```
#### SimpleQA w/o Search

## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md ),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests. 
							
						 
						
							2024-11-18 11:43:03 -08:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Ashwin Bharambe 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								3d7561e55c 
								
							 
						 
						
							
							
								
								Rename all inline providers with an inline:: prefix ( #423 )  
							
							
							
						 
						
							2024-11-11 22:19:16 -08:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Xi Yan 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								b4416b72fd 
								
							 
						 
						
							
							
								
								Folder restructure for evals/datasets/scoring ( #419 )  
							
							... 
							
							
							
							* rename evals related stuff
* fix datasetio
* fix scoring test
* localfs -> LocalFS
* refactor scoring
* refactor scoring
* remove 8b_correctness scoring_fn from tests
* tests w/ eval params
* scoring fn braintrust fixture
* import 
							
						 
						
							2024-11-11 17:35:40 -05:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Ashwin Bharambe 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								994732e2e0 
								
							 
						 
						
							
							
								
								impls -> inline, adapters -> remote (#381 )  
							
							
							
						 
						
							2024-11-06 14:54:05 -08:00 
							
								 
							
							
								 
							
						 
					 
				
					
						
							
								
								
									Xi Yan 
								
							 
						 
						
							
							
								
								
							
							
							
								
							
							
								abdf7cddf3 
								
							 
						 
						
							
							
								
								[Evals API][4/n] evals with generation meta-reference impl ( #303 )  
							
							... 
							
							
							
							* wip
* dataset validation
* test_scoring
* cleanup
* clean up test
* comments
* error checking
* dataset client
* test client:
* datasetio client
* clean up
* basic scoring function works
* scorer wip
* equality scorer
* score batch impl
* score batch
* update scoring test
* refactor
* validate scorer input
* address comments
* evals with generation
* add all rows scores to ScoringResult
* minor typing
* bugfix
* scoring function def rename
* rebase name
* refactor
* address comments
* Update iOS inference instructions for new quantization
* Small updates to quantization config
* Fix score threshold in faiss
* Bump version to 0.0.45
* Handle both ipv6 and ipv4 interfaces together
* update manifest for build templates
* Update getting_started.md
* chatcompletion & completion input type validation
* inclusion->subsetof
* error checking
* scoring_function -> scoring_fn rename, scorer -> scoring_fn rename
* address comments
* [Evals API][5/n] fixes to generate openapi spec (#323 )
* generate openapi
* typing comment, dataset -> dataset_id
* remove custom type
* sample eval run.yaml
---------
Co-authored-by: Dalton Flanagan <6599399+dltn@users.noreply.github.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com> 
							
						 
						
							2024-10-25 13:12:39 -07:00