mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-03 09:53:45 +00:00

History

Ashwin Bharambe ce33d02443 fix(tools): do not index tools, only index toolgroups (#2261 ) When registering a MCP endpoint, we cannot list tools (like we used to) since the MCP endpoint may be behind an auth wall. Registration can happen much sooner (via run.yaml). Instead, we do listing only when the _user_ actually calls listing. Furthermore, we cache the list in-memory in the server. Currently, the cache is not invalidated -- we may want to periodically re-list for MCP servers. Note that they must call `list_tools` before calling `invoke_tool` -- we use this critically. This will enable us to list MCP servers in run.yaml ## Test Plan Existing tests, updated tests accordingly.		2025-05-25 13:27:52 -07:00
..
agents	fix: disable test_responses_store (#2244 )	2025-05-24 08:18:06 -07:00
datasets	fix: test_datasets HF scenario in CI (#2090 )	2025-05-06 14:09:15 +02:00
eval	fix: fix jobs api literal return type (#1757 )	2025-03-21 14:04:21 -07:00
fixtures	chore: remove recordable mock (#2088 )	2025-05-05 10:08:55 -07:00
inference	fix: disable test_responses_store (#2244 )	2025-05-24 08:18:06 -07:00
inspect	test: add inspect unit test (#1417 )	2025-03-10 15:36:18 -07:00
post_training	feat: add huggingface post_training impl (#2132 )	2025-05-16 14:41:28 -07:00
providers	feat: Add NVIDIA NeMo datastore (#1852 )	2025-04-28 09:41:59 -07:00
safety	fix: misc fixes for tests kill horrible warnings	2025-04-12 17:12:11 -07:00
scoring	feat(api): (1/n) datasets api clean up (#1573 )	2025-03-17 16:55:45 -07:00
telemetry	fix: skip failing tests (#2243 )	2025-05-24 07:31:08 -07:00
test_cases	fix: llama4 tool use prompt fix (#2103 )	2025-05-06 22:18:31 -07:00
tool_runtime	fix(tools): do not index tools, only index toolgroups (#2261 )	2025-05-25 13:27:52 -07:00
tools	fix: toolgroups unregister (#1704 )	2025-03-19 13:43:51 -07:00
vector_io	fix: remove ruff N999 (#1388 )	2025-03-07 11:14:04 -08:00
__init__.py	fix: remove ruff N999 (#1388 )	2025-03-07 11:14:04 -08:00
conftest.py	chore: remove pytest reports (#2156 )	2025-05-13 22:40:15 -07:00
README.md	chore: remove pytest reports (#2156 )	2025-05-13 22:40:15 -07:00

README.md

Llama Stack Integration Tests

We use pytest for parameterizing and running tests. You can see all options with:

cd tests/integration

# this will show a long list of options, look for "Custom options:"
pytest --help

Here are the most important options:

--stack-config: specify the stack config to use. You have three ways to point to a stack:
- a URL which points to a Llama Stack distribution server
- a template (e.g., fireworks, together) or a path to a run.yaml file
- a comma-separated list of api=provider pairs, e.g. inference=fireworks,safety=llama-guard,agents=meta-reference. This is most useful for testing a single API surface.
--env: set environment variables, e.g. --env KEY=value. this is a utility option to set environment variables required by various providers.

Model parameters can be influenced by the following options:

--text-model: comma-separated list of text models.
--vision-model: comma-separated list of vision models.
--embedding-model: comma-separated list of embedding models.
--safety-shield: comma-separated list of safety shields.
--judge-model: comma-separated list of judge models.
--embedding-dimension: output dimensionality of the embedding model to use for testing. Default: 384

Each of these are comma-separated lists and can be used to generate multiple parameter combinations. Note that tests will be skipped if no model is specified.

Experimental, under development, options:

--record-responses: record new API responses instead of using cached ones

Examples

Run all text inference tests with the together distribution:

pytest -s -v tests/integration/inference/test_text_inference.py \
   --stack-config=together \
   --text-model=meta-llama/Llama-3.1-8B-Instruct

Run all text inference tests with the together distribution and meta-llama/Llama-3.1-8B-Instruct:

pytest -s -v tests/integration/inference/test_text_inference.py \
   --stack-config=together \
   --text-model=meta-llama/Llama-3.1-8B-Instruct

Running all inference tests for a number of models:

TEXT_MODELS=meta-llama/Llama-3.1-8B-Instruct,meta-llama/Llama-3.1-70B-Instruct
VISION_MODELS=meta-llama/Llama-3.2-11B-Vision-Instruct
EMBEDDING_MODELS=all-MiniLM-L6-v2
export TOGETHER_API_KEY=<together_api_key>

pytest -s -v tests/integration/inference/ \
   --stack-config=together \
   --text-model=$TEXT_MODELS \
   --vision-model=$VISION_MODELS \
   --embedding-model=$EMBEDDING_MODELS

Same thing but instead of using the distribution, use an adhoc stack with just one provider (fireworks for inference):

export FIREWORKS_API_KEY=<fireworks_api_key>

pytest -s -v tests/integration/inference/ \
   --stack-config=inference=fireworks \
   --text-model=$TEXT_MODELS \
   --vision-model=$VISION_MODELS \
   --embedding-model=$EMBEDDING_MODELS

Running Vector IO tests for a number of embedding models:

EMBEDDING_MODELS=all-MiniLM-L6-v2

pytest -s -v tests/integration/vector_io/ \
   --stack-config=inference=sentence-transformers,vector_io=sqlite-vec \
   --embedding-model=$EMBEDDING_MODELS