Commit graph

7 commits

Author SHA1 Message Date
Ashwin Bharambe
8da9ff352c feat(tests): make inference_recorder into api_recorder (include tool_invoke) 2025-09-10 11:22:15 -07:00
Derek Higgins
ef02b9ea10
fix: environment variable typo in inference recorder error message (#3374)
The error message was referencing LLAMA_STACK_INFERENCE_MODE instead of
the correct LLAMA_STACK_TEST_INFERENCE_MODE environment variable.
2025-09-08 17:51:38 +02:00
Ashwin Bharambe
c3d3a0b833
feat(tests): auto-merge all model list responses and unify recordings (#3320)
Some checks failed
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s
Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped
Integration Tests (Replay) / Integration Tests (, , , client=, vision=) (push) Failing after 3s
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 4s
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 7s
Update ReadTheDocs / update-readthedocs (push) Failing after 3s
Test External API and Providers / test-external (venv) (push) Failing after 5s
Vector IO Integration Tests / test-matrix (push) Failing after 7s
Python Package Build Test / build (3.13) (push) Failing after 8s
Python Package Build Test / build (3.12) (push) Failing after 8s
Unit Tests / unit-tests (3.13) (push) Failing after 14s
Unit Tests / unit-tests (3.12) (push) Failing after 14s
UI Tests / ui-tests (22) (push) Successful in 1m7s
Pre-commit / pre-commit (push) Successful in 2m34s
One needed to specify record-replay related environment variables for
running integration tests. We could not use defaults because integration
tests could be run against Ollama instances which could be running
different models. For example, text vs vision tests needed separate
instances of Ollama because a single instance typically cannot serve
both of these models if you assume the standard CI worker configuration
on Github. As a result, `client.list()` as returned by the Ollama client
would be different between these runs and we'd end up overwriting
responses.

This PR "solves" it by adding a small amount of complexity -- we store
model list responses specially, keyed by the hashes of the models they
return. At replay time, we merge all of them and pretend that we have
the union of all models available.

## Test Plan

Re-recorded all the tests using `scripts/integration-tests.sh
--inference-mode record`, including the vision tests.
2025-09-03 11:33:03 -07:00
Derek Higgins
7ca8233889
feat(testing): remove SQLite dependency from inference recorder (#3254)
Recording files use a predictable naming format, making the SQLite index
redundant. The binary SQLite file was causing frequent git conflicts.
Simplify by calculating file paths directly from request hashes.

Signed-off-by: Derek Higgins <derekh@redhat.com>
2025-08-26 09:17:00 -07:00
Ashwin Bharambe
eb07a0f86a
fix(ci, tests): ensure uv environments in CI are kosher, record tests (#3193)
Some checks failed
Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 21s
Test Llama Stack Build / build-single-provider (push) Failing after 23s
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 28s
Test Llama Stack Build / generate-matrix (push) Successful in 25s
Python Package Build Test / build (3.13) (push) Failing after 25s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 34s
Integration Tests (Replay) / Integration Tests (, , , client=, vision=) (push) Failing after 37s
Test External API and Providers / test-external (venv) (push) Failing after 33s
Unit Tests / unit-tests (3.13) (push) Failing after 33s
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 38s
Python Package Build Test / build (3.12) (push) Failing after 1m0s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1m4s
Unit Tests / unit-tests (3.12) (push) Failing after 59s
Test Llama Stack Build / build (push) Failing after 50s
Vector IO Integration Tests / test-matrix (push) Failing after 1m48s
UI Tests / ui-tests (22) (push) Successful in 2m12s
Pre-commit / pre-commit (push) Successful in 2m41s
I started this PR trying to unbreak a newly broken test
`test_agent_name`. This test was broken all along but did not show up
because during testing we were pulling the "non-updated" llama stack
client. See this comment:
https://github.com/llamastack/llama-stack/pull/3119#discussion_r2270988205

While fixing this, I encountered a large amount of badness in our CI
workflow definitions.

- We weren't passing `LLAMA_STACK_DIR` or `LLAMA_STACK_CLIENT_DIR`
overrides to `llama stack build` at all in some cases.
- Even when we did, we used `uv run` liberally. The first thing `uv run`
does is "syncs" the project environment. This means, it is going to undo
any mutations we might have done ourselves. But we make many mutations
in our CI runners to these environments. The most important of which is
why `llama stack build` where we install distro dependencies. As a
result, when you tried to run the integration tests, you would see old,
strange versions.


## Test Plan

Re-record using:

```
sh scripts/integration-tests.sh --stack-config ci-tests \
  --provider ollama --test-pattern test_agent_name --inference-mode record
```

Then re-run with `--inference-mode replay`. But: 

Eventually, this test turned out to be quite flaky for telemetry
reasons. I haven't investigated it for now and just disabled it sadly
since we have a release to push out.
2025-08-18 17:02:24 -07:00
ehhuang
6ac710f3b0
fix(recording): endpoint resolution (#3013)
Some checks failed
Integration Tests (Replay) / discover-tests (push) Successful in 5s
Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 7s
Vector IO Integration Tests / test-matrix (3.12, remote::qdrant) (push) Failing after 13s
Vector IO Integration Tests / test-matrix (3.13, remote::pgvector) (push) Failing after 12s
Vector IO Integration Tests / test-matrix (3.12, inline::faiss) (push) Failing after 17s
Vector IO Integration Tests / test-matrix (3.12, remote::weaviate) (push) Failing after 15s
Integration Tests (Replay) / run-replay-mode-tests (push) Failing after 10s
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 19s
Python Package Build Test / build (3.12) (push) Failing after 12s
Vector IO Integration Tests / test-matrix (3.13, remote::qdrant) (push) Failing after 15s
Test External API and Providers / test-external (venv) (push) Failing after 13s
Vector IO Integration Tests / test-matrix (3.12, remote::chromadb) (push) Failing after 18s
Python Package Build Test / build (3.13) (push) Failing after 14s
Vector IO Integration Tests / test-matrix (3.13, remote::weaviate) (push) Failing after 18s
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 23s
Unit Tests / unit-tests (3.12) (push) Failing after 17s
Vector IO Integration Tests / test-matrix (3.13, inline::faiss) (push) Failing after 18s
Vector IO Integration Tests / test-matrix (3.12, inline::milvus) (push) Failing after 21s
Vector IO Integration Tests / test-matrix (3.13, remote::chromadb) (push) Failing after 19s
Vector IO Integration Tests / test-matrix (3.12, remote::pgvector) (push) Failing after 21s
Vector IO Integration Tests / test-matrix (3.13, inline::sqlite-vec) (push) Failing after 17s
Vector IO Integration Tests / test-matrix (3.12, inline::sqlite-vec) (push) Failing after 56s
Unit Tests / unit-tests (3.13) (push) Failing after 52s
Vector IO Integration Tests / test-matrix (3.13, inline::milvus) (push) Failing after 55s
Pre-commit / pre-commit (push) Successful in 1m49s
# What does this PR do?


## Test Plan
2025-08-01 16:23:54 -07:00
Ashwin Bharambe
08b4a1deb3
feat(tests): introduce inference record/replay to increase test reliability (#2941)
Implements a comprehensive recording and replay system for inference API
calls that eliminates dependency on online inference providers during
testing. The system treats inference as deterministic by recording real
API responses and replaying them in subsequent test runs. Applies to
OpenAI clients (which should cover many inference requests) as well as
Ollama AsyncClient.

For storing, we use a hybrid system: Sqlite for fast lookups and JSON
files for easy greppability / debuggability.

As expected, tests become much much faster (more than 3x in just
inference testing.)

```bash
LLAMA_STACK_TEST_INFERENCE_MODE=record LLAMA_STACK_TEST_RECORDING_DIR=<...> \
  uv run pytest -s -v tests/integration/inference \
  --stack-config=starter \
  -k "not( builtin_tool or safety_with_image or code_interpreter or test_rag )" \
  --text-model="ollama/llama3.2:3b-instruct-fp16" \
  --embedding-model=sentence-transformers/all-MiniLM-L6-v2
```

```bash
LLAMA_STACK_TEST_INFERENCE_MODE=replay LLAMA_STACK_TEST_RECORDING_DIR=<...> \
  uv run pytest -s -v tests/integration/inference \
  --stack-config=starter \
  -k "not( builtin_tool or safety_with_image or code_interpreter or test_rag )" \
  --text-model="ollama/llama3.2:3b-instruct-fp16" \
  --embedding-model=sentence-transformers/all-MiniLM-L6-v2
```

- `LLAMA_STACK_TEST_INFERENCE_MODE`: `live` (default), `record`, or
`replay`
- `LLAMA_STACK_TEST_RECORDING_DIR`: Storage location (must be specified
for record or replay modes)
2025-07-29 12:41:31 -07:00