llama-stack-mirror/tests
Matthew Farrellee 36543a1100 chore: fix agents tests for non-ollama providers, provide max_tokens (#3657)
# What does this PR do?

closes #3656 


## Test Plan

openai is not enabled in ci, so manual testing with:

```
$ ./scripts/integration-tests.sh --stack-config ci-tests --suite base --setup gpt --subdirs agents --inference-mode live                            
=== Llama Stack Integration Test Runner ===
Stack Config: ci-tests
Setup: gpt
Inference Mode: live
Test Suite: base
Test Subdirs: agents
Test Pattern: 

Checking llama packages
llama-stack                              0.2.23          .../llama-stack
llama-stack-client                       0.3.0a3
ollama                                   0.5.1
=== System Resources Before Tests ===
...
=== Applying Setup Environment Variables ===
Setting up environment variables:
=== Running Integration Tests ===
Test subdirs to run: agents
Added test files from agents: 3 files

=== Running all collected tests in a single pytest command ===
Total test files: 3
+ pytest -s -v tests/integration/agents/test_persistence.py tests/integration/agents/test_openai_responses.py tests/integration/agents/test_agents.py --stack-config=ci-tests --inference-mode=live -k 'not( builtin_tool or safety_with_image or code_interpreter or test_rag )' --setup=gpt --color=yes --capture=tee-sys
WARNING  2025-10-02 13:14:32,653 root:258 uncategorized: Unknown logging category: providers::utils. Falling back to default 'root' level: 20         
WARNING  2025-10-02 13:14:33,043 root:258 uncategorized: Unknown logging category: tests. Falling back to default 'root' level: 20                    
INFO     2025-10-02 13:14:33,063 tests.integration.conftest:86 tests: Applying setup 'gpt'                                                            
========================================= test session starts ==========================================
platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0 -- .../.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.12.11', 'Platform': 'Linux-6.16.7-200.fc42.x86_64-x86_64-with-glibc2.41', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'html': '4.1.1', 'anyio': '4.9.0', 'timeout': '2.4.0', 'cov': '6.2.1', 'asyncio': '1.1.0', 'nbval': '0.11.0', 'socket': '0.7.0', 'json-report': '1.5.0', 'metadata': '3.1.1'}}
rootdir: ...
configfile: pyproject.toml
plugins: html-4.1.1, anyio-4.9.0, timeout-2.4.0, cov-6.2.1, asyncio-1.1.0, nbval-0.11.0, socket-0.7.0, json-report-1.5.0, metadata-3.1.1
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 32 items / 6 deselected / 26 selected                                                        

tests/integration/agents/test_persistence.py::test_delete_agents_and_sessions SKIPPED (This ...) [  3%]
tests/integration/agents/test_persistence.py::test_get_agent_turns_and_steps SKIPPED (This t...) [  7%]
tests/integration/agents/test_openai_responses.py::test_responses_store[openai_client-txt=openai/gpt-4o-tools0-True] 
instantiating llama_stack_client
WARNING  2025-10-02 13:14:33,472 root:258 uncategorized: Unknown logging category: testing. Falling back to default 'root' level: 20                  
WARNING  2025-10-02 13:14:33,477 root:258 uncategorized: Unknown logging category: providers::utils. Falling back to default 'root' level: 20         
WARNING  2025-10-02 13:14:33,960 root:258 uncategorized: Unknown logging category: tokenizer_utils. Falling back to default 'root' level: 20          
WARNING  2025-10-02 13:14:33,962 root:258 uncategorized: Unknown logging category: models::llama. Falling back to default 'root' level: 20            
WARNING  2025-10-02 13:14:33,963 root:258 uncategorized: Unknown logging category: models::llama. Falling back to default 'root' level: 20            
WARNING  2025-10-02 13:14:33,968 root:258 uncategorized: Unknown logging category: providers::utils. Falling back to default 'root' level: 20         
WARNING  2025-10-02 13:14:33,974 root:258 uncategorized: Unknown logging category: providers::utils. Falling back to default 'root' level: 20         
WARNING  2025-10-02 13:14:33,978 root:258 uncategorized: Unknown logging category: providers::utils. Falling back to default 'root' level: 20         
WARNING  2025-10-02 13:14:35,350 root:258 uncategorized: Unknown logging category: providers::utils. Falling back to default 'root' level: 20         
WARNING  2025-10-02 13:14:35,366 root:258 uncategorized: Unknown logging category: providers::utils. Falling back to default 'root' level: 20         
WARNING  2025-10-02 13:14:35,489 root:258 uncategorized: Unknown logging category: providers::utils. Falling back to default 'root' level: 20         
WARNING  2025-10-02 13:14:35,490 root:258 uncategorized: Unknown logging category: inference_store. Falling back to default 'root' level: 20          
WARNING  2025-10-02 13:14:35,697 root:258 uncategorized: Unknown logging category: providers::utils. Falling back to default 'root' level: 20         
WARNING  2025-10-02 13:14:35,918 root:258 uncategorized: Unknown logging category: providers::utils. Falling back to default 'root' level: 20         
INFO     2025-10-02 13:14:35,945 llama_stack.providers.utils.inference.inference_store:74 inference_store: Write queue disabled for SQLite to avoid   
         concurrency issues                                                                                                                           
WARNING  2025-10-02 13:14:36,172 root:258 uncategorized: Unknown logging category: files. Falling back to default 'root' level: 20                    
WARNING  2025-10-02 13:14:36,218 root:258 uncategorized: Unknown logging category: providers::utils. Falling back to default 'root' level: 20         
WARNING  2025-10-02 13:14:36,219 root:258 uncategorized: Unknown logging category: vector_io. Falling back to default 'root' level: 20                
WARNING  2025-10-02 13:14:36,231 root:258 uncategorized: Unknown logging category: vector_io. Falling back to default 'root' level: 20                
WARNING  2025-10-02 13:14:36,255 root:258 uncategorized: Unknown logging category: tool_runtime. Falling back to default 'root' level: 20             
WARNING  2025-10-02 13:14:36,486 root:258 uncategorized: Unknown logging category: responses_store. Falling back to default 'root' level: 20          
WARNING  2025-10-02 13:14:36,503 root:258 uncategorized: Unknown logging category: openai::responses. Falling back to default 'root' level: 20        
INFO     2025-10-02 13:14:36,524 llama_stack.providers.utils.responses.responses_store:80 responses_store: Write queue disabled for SQLite to avoid   
         concurrency issues                                                                                                                           
WARNING  2025-10-02 13:14:36,528 root:258 uncategorized: Unknown logging category: providers::utils. Falling back to default 'root' level: 20         
WARNING  2025-10-02 13:14:36,703 root:258 uncategorized: Unknown logging category: uncategorized. Falling back to default 'root' level: 20            
WARNING  2025-10-02 13:14:36,726 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider fireworks: Pass    
         Fireworks API Key in the header X-LlamaStack-Provider-Data as { "fireworks_api_key": <your api key>}                                         
WARNING  2025-10-02 13:14:36,727 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider together: Pass     
         Together API Key in the header X-LlamaStack-Provider-Data as { "together_api_key": <your api key>}                                           
WARNING  2025-10-02 13:14:38,404 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider anthropic: API key 
         is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"anthropic_api_key": "<API_KEY>"}, 
         or in the provider config.                                                                                                                   
WARNING  2025-10-02 13:14:38,406 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider gemini: API key is 
         not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or in 
         the provider config.                                                                                                                         
WARNING  2025-10-02 13:14:38,408 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider groq: API key is   
         not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in   
         the provider config.                                                                                                                         
WARNING  2025-10-02 13:14:38,411 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider sambanova: API key 
         is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"}, 
         or in the provider config.                                                                                                                   
llama_stack_client instantiated in 5.237s
SKIPPED [ 11%]
tests/integration/agents/test_openai_responses.py::test_list_response_input_items[openai_client-txt=openai/gpt-4o] SKIPPED [ 15%]
tests/integration/agents/test_openai_responses.py::test_list_response_input_items_with_limit_and_order[txt=openai/gpt-4o] SKIPPED [ 19%]
tests/integration/agents/test_openai_responses.py::test_function_call_output_response[txt=openai/gpt-4o] SKIPPED [ 23%]
tests/integration/agents/test_openai_responses.py::test_function_call_output_response_with_none_arguments[txt=openai/gpt-4o] SKIPPED [ 26%]
tests/integration/agents/test_agents.py::test_agent_simple[openai/gpt-4o] PASSED                 [ 30%]
tests/integration/agents/test_agents.py::test_agent_name[txt=openai/gpt-4o] SKIPPED (this te...) [ 34%]
tests/integration/agents/test_agents.py::test_tool_config[openai/gpt-4o] PASSED                  [ 38%]
tests/integration/agents/test_agents.py::test_custom_tool[openai/gpt-4o] FAILED                  [ 42%]
tests/integration/agents/test_agents.py::test_custom_tool_infinite_loop[openai/gpt-4o] PASSED    [ 46%]
tests/integration/agents/test_agents.py::test_tool_choice_required[openai/gpt-4o] INFO     2025-10-02 13:14:51,559 llama_stack.providers.inline.agents.meta_reference.agent_instance:691 agents::meta_reference: done with MAX          
         iterations (2), exiting.                                                                                                                     
PASSED         [ 50%]
tests/integration/agents/test_agents.py::test_tool_choice_none[openai/gpt-4o] PASSED             [ 53%]
tests/integration/agents/test_agents.py::test_tool_choice_get_boiling_point[openai/gpt-4o] XFAIL [ 57%]
tests/integration/agents/test_agents.py::test_create_turn_response[openai/gpt-4o-client_tools0] PASSED [ 61%]
tests/integration/agents/test_agents.py::test_multi_tool_calls[openai/gpt-4o] PASSED             [ 65%]
tests/integration/agents/test_openai_responses.py::test_responses_store[openai_client-txt=openai/gpt-4o-tools0-False] SKIPPED [ 69%]
tests/integration/agents/test_openai_responses.py::test_list_response_input_items[client_with_models-txt=openai/gpt-4o] PASSED [ 73%]
tests/integration/agents/test_agents.py::test_create_turn_response[openai/gpt-4o-client_tools1] PASSED [ 76%]
tests/integration/agents/test_openai_responses.py::test_responses_store[openai_client-txt=openai/gpt-4o-tools1-True] SKIPPED [ 80%]
tests/integration/agents/test_openai_responses.py::test_responses_store[openai_client-txt=openai/gpt-4o-tools1-False] SKIPPED [ 84%]
tests/integration/agents/test_openai_responses.py::test_responses_store[client_with_models-txt=openai/gpt-4o-tools0-True] SKIPPED [ 88%]
tests/integration/agents/test_openai_responses.py::test_responses_store[client_with_models-txt=openai/gpt-4o-tools0-False] SKIPPED [ 92%]
tests/integration/agents/test_openai_responses.py::test_responses_store[client_with_models-txt=openai/gpt-4o-tools1-True] SKIPPED [ 96%]
tests/integration/agents/test_openai_responses.py::test_responses_store[client_with_models-txt=openai/gpt-4o-tools1-False] SKIPPED [100%]

=============================================== FAILURES ===============================================
___________________________________ test_custom_tool[openai/gpt-4o] ____________________________________
tests/integration/agents/test_agents.py:370: in test_custom_tool
    assert "-100" in logs_str
E   assert '-100' in "inference> Polyjuice Potion is a fictional substance from the Harry Potter series, and it doesn't have a scientifically defined boiling point. If you have any other real liquid in mind, feel free to ask!"
========================================= slowest 10 durations =========================================
5.47s setup    tests/integration/agents/test_openai_responses.py::test_responses_store[openai_client-txt=openai/gpt-4o-tools0-True]
4.78s call     tests/integration/agents/test_agents.py::test_custom_tool[openai/gpt-4o]
3.01s call     tests/integration/agents/test_agents.py::test_tool_choice_required[openai/gpt-4o]
2.97s call     tests/integration/agents/test_agents.py::test_agent_simple[openai/gpt-4o]
2.85s call     tests/integration/agents/test_agents.py::test_tool_choice_none[openai/gpt-4o]
2.06s call     tests/integration/agents/test_agents.py::test_multi_tool_calls[openai/gpt-4o]
1.83s call     tests/integration/agents/test_agents.py::test_create_turn_response[openai/gpt-4o-client_tools0]
1.83s call     tests/integration/agents/test_agents.py::test_custom_tool_infinite_loop[openai/gpt-4o]
1.29s call     tests/integration/agents/test_agents.py::test_create_turn_response[openai/gpt-4o-client_tools1]
0.57s call     tests/integration/agents/test_openai_responses.py::test_list_response_input_items[client_with_models-txt=openai/gpt-4o]
======================================= short test summary info ========================================
FAILED tests/integration/agents/test_agents.py::test_custom_tool[openai/gpt-4o] - assert '-100' in "inference> Polyjuice Potion is a fictional substance from the Harry Potter series...
=========== 1 failed, 9 passed, 15 skipped, 6 deselected, 1 xfailed, 139 warnings in 27.18s ============
```
note: the failure is separate from the issue being fixed
2025-10-02 21:50:12 -07:00
..
common feat: Add items and title to ToolParameter/ToolParamDefinition (#3003) 2025-09-27 11:35:29 -07:00
containers feat(ci): add support for running vision inference tests (#2972) 2025-07-31 11:50:42 -07:00
external feat: introduce API leveling, post_training, eval to v1alpha (#3449) 2025-09-26 16:18:07 +02:00
integration chore: fix agents tests for non-ollama providers, provide max_tokens (#3657) 2025-10-02 21:50:12 -07:00
unit fix precommit errors 2025-10-02 11:34:10 -07:00
__init__.py refactor(test): introduce --stack-config and simplify options (#1404) 2025-03-05 17:02:02 -08:00
README.md feat(tests): introduce a test "suite" concept to encompass dirs, options (#3339) 2025-09-05 13:58:49 -07:00

There are two obvious types of tests:

Type Location Purpose
Unit tests/unit/ Fast, isolated component testing
Integration tests/integration/ End-to-end workflows with record-replay

Both have their place. For unit tests, it is important to create minimal mocks and instead rely more on "fakes". Mocks are too brittle. In either case, tests must be very fast and reliable.

Record-replay for integration tests

Testing AI applications end-to-end creates some challenges:

  • API costs accumulate quickly during development and CI
  • Non-deterministic responses make tests unreliable
  • Multiple providers require testing the same logic across different APIs

Our solution: Record real API responses once, replay them for fast, deterministic tests. This is better than mocking because AI APIs have complex response structures and streaming behavior. Mocks can miss edge cases that real APIs exhibit. A single test can exercise underlying APIs in multiple complex ways making it really hard to mock.

This gives you:

  • Cost control - No repeated API calls during development
  • Speed - Instant test execution with cached responses
  • Reliability - Consistent results regardless of external service state
  • Provider coverage - Same tests work across OpenAI, Anthropic, local models, etc.

Testing Quick Start

You can run the unit tests with:

uv run --group unit pytest -sv tests/unit/

For running integration tests, you must provide a few things:

  • A stack config. This is a pointer to a stack. You have a few ways to point to a stack:

    • server:<config> - automatically start a server with the given config (e.g., server:starter). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
    • server:<config>:<port> - same as above but with a custom port (e.g., server:starter:8322)
    • a URL which points to a Llama Stack distribution server
    • a distribution name (e.g., starter) or a path to a run.yaml file
    • a comma-separated list of api=provider pairs, e.g. inference=fireworks,safety=llama-guard,agents=meta-reference. This is most useful for testing a single API surface.
  • Any API keys you need to use should be set in the environment, or can be passed in with the --env option.

You can run the integration tests in replay mode with:

# Run all tests with existing recordings
  uv run --group test \
  pytest -sv tests/integration/ --stack-config=starter

Re-recording tests

Local Re-recording (Manual Setup Required)

If you want to re-record tests locally, you can do so with:

LLAMA_STACK_TEST_INFERENCE_MODE=record \
  uv run --group test \
  pytest -sv tests/integration/ --stack-config=starter -k "<appropriate test name>"

This will record new API responses and overwrite the existing recordings.


You must be careful when re-recording. CI workflows assume a specific setup for running the replay-mode tests. You must re-record the tests in the same way as the CI workflows. This means
- you need Ollama running and serving some specific models.
- you are using the `starter` distribution.

For easier re-recording without local setup, use the automated recording workflow:

# Record tests for specific test subdirectories
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents,inference"

# Record with vision tests enabled
./scripts/github/schedule-record-workflow.sh --test-suite vision

# Record with specific provider
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents" --test-provider vllm

This script:

  • 🚀 Runs in GitHub Actions - no local Ollama setup required
  • 🔍 Auto-detects your branch and associated PR
  • 🍴 Works from forks - handles repository context automatically
  • Commits recordings back to your branch

Prerequisites:

  • GitHub CLI: brew install gh && gh auth login
  • jq: brew install jq
  • Your branch pushed to a remote

Supported providers: vllm, ollama

Next Steps