# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
The purpose of this PR is to replace the Llama Stack's default embedding
model by nomic-embed-text-v1.5.
These are the key reasons why Llama Stack community decided to switch
from all-MiniLM-L6-v2 to nomic-embed-text-v1.5:
1. The training data for
[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#training-data)
includes a lot of data sets with various licensing terms, so it is
tricky to know when/whether it is appropriate to use this model for
commercial applications.
2. The model is not particularly competitive on major benchmarks. For
example, if you look at the [MTEB
Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) and click
on Miscellaneous/BEIR to see English information retrieval accuracy, you
see that the top of the leaderboard is dominated by enormous models but
also that there are many, many models of relatively modest size whith
much higher Retrieval scores. If you want to look closely at the data, I
recommend clicking "Download Table" because it is easier to browse that
way.
More discussion info can be founded
[here](https://github.com/llamastack/llama-stack/issues/2418)
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2418
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
1. Run `./scripts/unit-tests.sh`
2. Integration tests via CI wokrflow
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
As shown in #3421, we can scale stack to handle more RPS with k8s
replicas. This PR enables multi process stack with uvicorn --workers so
that we can achieve the same scaling without being in k8s.
To achieve that we refactor main to split out the app construction
logic. This method needs to be non-async. We created a new `Stack` class
to house impls and have a `start()` method to be called in lifespan to
start background tasks instead of starting them in the old
`construct_stack`. This way we avoid having to manage an event loop
manually.
## Test Plan
CI
> uv run --with llama-stack python -m llama_stack.core.server.server
benchmarking/k8s-benchmark/stack_run_config.yaml
works.
> LLAMA_STACK_CONFIG=benchmarking/k8s-benchmark/stack_run_config.yaml uv
run uvicorn llama_stack.core.server.server:create_app --port 8321
--workers 4
works.