Our integration tests need to be 'grouped' because each group often
needs a specific set of models it works with. We separated vision tests
due to this, and we have a separate set of tests which test "Responses"
API.
This PR makes this system a bit more official so it is very easy to
target these groups and apply all testing infrastructure towards all the
groups (for example, record-replay) uniformly.
There are three suites declared:
- base
- vision
- responses
Note that our CI currently runs the "base" and "vision" suites.
You can use the `--suite` option when running pytest (or any of the
testing scripts or workflows.) For example:
```
OLLAMA_URL=http://localhost:11434 \
pytest -s -v tests/integration/ --stack-config starter --suite vision
```
# What does this PR do?
A _bunch_ on cleanup for the Responses tests.
- Got rid of YAML test cases, moved them to just use simple pydantic models
- Splitting the large monolithic test file into multiple focused test files:
- `test_basic_responses.py` for basic and image response tests
- `test_tool_responses.py` for tool-related tests
- `test_file_search.py` for file search specific tests
- Adding a `StreamingValidator` helper class to standardize streaming response validation
## Test Plan
Run the tests:
```
pytest -s -v tests/integration/non_ci/responses/ \
--stack-config=starter \
--text-model openai/gpt-4o \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2 \
-k "client_with_models"
```