# What does this PR do?
## Test Plan
# What does this PR do?
## Test Plan
# What does this PR do?
## Test Plan
Completes the refactoring started in previous commit by:
1. **Fix library client** (critical): Add logic to detect Pydantic model parameters
and construct them properly from request bodies. The key fix is to NOT exclude
any params when converting the body for Pydantic models - we need all fields
to pass to the Pydantic constructor.
Before: _convert_body excluded all params, leaving body empty for Pydantic construction
After: Check for Pydantic params first, skip exclusion, construct model with full body
2. **Update remaining providers** to use new Pydantic-based signatures:
- litellm_openai_mixin: Extract extra fields via __pydantic_extra__
- databricks: Use TYPE_CHECKING import for params type
- llama_openai_compat: Use TYPE_CHECKING import for params type
- sentence_transformers: Update method signatures to use params
3. **Update unit tests** to use new Pydantic signature:
- test_openai_mixin.py: Use OpenAIChatCompletionRequestParams
This fixes test failures where the library client was trying to construct
Pydantic models with empty dictionaries.
The previous fix had a bug: it called _convert_body() which only keeps fields
that match function parameter names. For Pydantic methods with signature:
openai_chat_completion(params: OpenAIChatCompletionRequestParams)
The signature only has 'params', but the body has 'model', 'messages', etc.
So _convert_body() returned an empty dict.
Fix: Skip _convert_body() entirely for Pydantic params. Use the raw body
directly to construct the Pydantic model (after stripping NOT_GIVENs).
This properly fixes the ValidationError where required fields were missing.
The streaming code path (_call_streaming) had the same issue as non-streaming:
it called _convert_body() which returned empty dict for Pydantic params.
Applied the same fix as commit 7476c0ae:
- Detect Pydantic model parameters before body conversion
- Skip _convert_body() for Pydantic params
- Construct Pydantic model directly from raw body (after stripping NOT_GIVENs)
This fixes streaming endpoints like openai_chat_completion with stream=True.
The streaming code path (_call_streaming) had the same issue as non-streaming:
it called _convert_body() which returned empty dict for Pydantic params.
Applied the same fix as commit 7476c0ae:
- Detect Pydantic model parameters before body conversion
- Skip _convert_body() for Pydantic params
- Construct Pydantic model directly from raw body (after stripping NOT_GIVENs)
This fixes streaming endpoints like openai_chat_completion with stream=True.
Propagate test IDs from client to server via HTTP headers to maintain
proper test isolation when running with server-based stack configs.
Without
this, recorded/replayed inference requests in server mode would leak
across
tests.
Changes:
- Patch client _prepare_request to inject test ID into provider data
header
- Sync test context from provider data on server side before storage
operations
- Set LLAMA_STACK_TEST_STACK_CONFIG_TYPE env var based on stack config
- Configure console width for cleaner log output in CI
- Add SQLITE_STORE_DIR temp directory for test data isolation
Uses test_id in request hashes and test-scoped subdirectories to prevent
cross-test contamination. Model list endpoints exclude test_id to enable
merging recordings from different servers.
Additionally, this PR adds a `record-if-missing` mode (which we will use
instead of `record` which records everything) which is very useful.
🤖 Co-authored with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude <noreply@anthropic.com>
Implements a comprehensive recording and replay system for inference API
calls that eliminates dependency on online inference providers during
testing. The system treats inference as deterministic by recording real
API responses and replaying them in subsequent test runs. Applies to
OpenAI clients (which should cover many inference requests) as well as
Ollama AsyncClient.
For storing, we use a hybrid system: Sqlite for fast lookups and JSON
files for easy greppability / debuggability.
As expected, tests become much much faster (more than 3x in just
inference testing.)
```bash
LLAMA_STACK_TEST_INFERENCE_MODE=record LLAMA_STACK_TEST_RECORDING_DIR=<...> \
uv run pytest -s -v tests/integration/inference \
--stack-config=starter \
-k "not( builtin_tool or safety_with_image or code_interpreter or test_rag )" \
--text-model="ollama/llama3.2:3b-instruct-fp16" \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2
```
```bash
LLAMA_STACK_TEST_INFERENCE_MODE=replay LLAMA_STACK_TEST_RECORDING_DIR=<...> \
uv run pytest -s -v tests/integration/inference \
--stack-config=starter \
-k "not( builtin_tool or safety_with_image or code_interpreter or test_rag )" \
--text-model="ollama/llama3.2:3b-instruct-fp16" \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2
```
- `LLAMA_STACK_TEST_INFERENCE_MODE`: `live` (default), `record`, or
`replay`
- `LLAMA_STACK_TEST_RECORDING_DIR`: Storage location (must be specified
for record or replay modes)