# What does this PR do? <!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. --> The purpose of this PR is to replace the Llama Stack's default embedding model by nomic-embed-text-v1.5. These are the key reasons why Llama Stack community decided to switch from all-MiniLM-L6-v2 to nomic-embed-text-v1.5: 1. The training data for [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#training-data) includes a lot of data sets with various licensing terms, so it is tricky to know when/whether it is appropriate to use this model for commercial applications. 2. The model is not particularly competitive on major benchmarks. For example, if you look at the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) and click on Miscellaneous/BEIR to see English information retrieval accuracy, you see that the top of the leaderboard is dominated by enormous models but also that there are many, many models of relatively modest size whith much higher Retrieval scores. If you want to look closely at the data, I recommend clicking "Download Table" because it is easier to browse that way. More discussion info can be founded [here](https://github.com/llamastack/llama-stack/issues/2418) <!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> Closes #2418 ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* --> 1. Run `./scripts/unit-tests.sh` 2. Integration tests via CI wokrflow --------- Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com> Co-authored-by: Sébastien Han <seb@redhat.com> |
||
---|---|---|
.. | ||
responses | ||
README.md |
Test Recording System
This directory contains recorded inference API responses used for deterministic testing without requiring live API access.
Structure
responses/
- JSON files containing request/response pairs for inference operations
Recording Format
Each JSON file contains:
request
- The normalized request parameters (method, endpoint, body)response
- The response body (serialized from Pydantic models)
Normalization
To reduce noise in git diffs, the recording system automatically normalizes fields that vary between runs but don't affect test behavior:
OpenAI-style responses
id
- Deterministic hash based on request:rec-{request_hash[:12]}
created
- Normalized to epoch:0
Ollama-style responses
created_at
- Normalized to:"1970-01-01T00:00:00.000000Z"
total_duration
- Normalized to:0
load_duration
- Normalized to:0
prompt_eval_duration
- Normalized to:0
eval_duration
- Normalized to:0
These normalizations ensure that re-recording tests produces minimal git diffs, making it easier to review actual changes to test behavior.
Usage
Replay mode (default)
Responses are replayed from recordings:
LLAMA_STACK_TEST_INFERENCE_MODE=replay pytest tests/integration/
Record-if-missing mode (recommended for adding new tests)
Records only when no recording exists, otherwise replays. Use this for iterative development:
LLAMA_STACK_TEST_INFERENCE_MODE=record-if-missing pytest tests/integration/
Recording mode
Force-records all API interactions, overwriting existing recordings. Use with caution:
LLAMA_STACK_TEST_INFERENCE_MODE=record pytest tests/integration/
Live mode
Skip recordings entirely and use live APIs:
LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/
Re-normalizing Existing Recordings
If you need to apply normalization to existing recordings (e.g., after updating the normalization logic):
python scripts/normalize_recordings.py
Use --dry-run
to preview changes without modifying files.