# What does this PR do?
pymilvus recently made `milvus-lite` an optional dependency to their
package. If someone wants to use the inline provider we must include the
extra dependency.
For more details see: https://github.com/milvus-io/pymilvus/pull/2976
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
currently `RemoteProviderSpec` has an `AdapterSpec` embedded in it.
Remove `AdapterSpec`, and put its leftover fields into
`RemoteProviderSpec`.
Additionally, many of the fields were duplicated between
`InlineProviderSpec` and `RemoteProviderSpec`. Move these to
`ProviderSpec` so they are shared.
Fixup the distro codegen to use `RemoteProviderSpec` directly rather
than `remote_provider_spec` which took an AdapterSpec and returned a
full provider spec
## Test Plan
existing distro tests should pass.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
this replaces the static model listing for any provider using
OpenAIMixin
currently -
- anthropic
- azure openai
- gemini
- groq
- llama-api
- nvidia
- openai
- sambanova
- tgi
- vertexai
- vllm
- not changed: together has its own impl
## Test Plan
- new unit tests
- manual for llama-api, openai, groq, gemini
```
for provider in llama-openai-compat openai groq gemini; do
uv run llama stack build --image-type venv --providers inference=remote::provider --run &
uv run --with llama-stack-client llama-stack-client models list | grep Total
```
results (17 sep 2025):
- llama-api: 4
- openai: 86
- groq: 21
- gemini: 66
closes#3467
# What does this PR do?
*Add dynamic authentication token forwarding support for vLLM provider*
This enables per-request authentication tokens for vLLM providers,
supporting use cases like RAG operations where different requests may
need different authentication tokens. The implementation follows the
same pattern as other providers like Together AI, Fireworks, and
Passthrough.
- Add LiteLLMOpenAIMixin that manages the vllm_api_token properly
Usage:
- Static: VLLM_API_TOKEN env var or config.api_token
- Dynamic: X-LlamaStack-Provider-Data header with vllm_api_token
All existing functionality is preserved while adding new dynamic
capabilities.
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
```
curl -X POST "http://localhost:8000/v1/chat/completions" -H "Authorization: Bearer my-dynamic-token" \
-H "X-LlamaStack-Provider-Data: {\"vllm_api_token\": \"Bearer my-dynamic-token\", \"vllm_url\": \"http://dynamic-server:8000\"}" \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.1-8b", "messages": [{"role": "user", "content": "Hello!"}]}'
```
---------
Signed-off-by: Akram Ben Aissi <akram.benaissi@gmail.com>
# What does this PR do?
Updates the qdrant provider's convert_id function to use a
FIPS-validated cryptographic hashing function, so that llama-stack is
considered to be `Designed for FIPS`.
The standard library `uuid.uuid5()` function uses SHA-1 under the hood,
which is not FIPS-validated. This commit uses an approach similar to the
one merged in #3423.
Closes#3476.
## Test Plan
Unit tests from scripts/unit-tests.sh were ran to verify that the tests
pass.
A small test script can display the data flow:
```python
import hashlib
import uuid
# Input
_id = "chunk_abc123"
print(_id)
# Step 1: Format and encode
hash_input = f"qdrant_id:{_id}".encode()
print(hash_input)
# Result: b'qdrant_id:chunk_abc123'
# Step 2: SHA-256 hash
sha256_hash = hashlib.sha256(hash_input).hexdigest()
print(sha256_hash)
# Result: "184893a6eafeaac487cb9166351e8625b994d50f3456d8bc6cea32a014a27151"
# Step 3: Create UUID from first 32 chars
uuid_string = str(uuid.UUID(sha256_hash[:32]))
print(uuid_string)
# sha256_hash[:32] = "184893a6eafeaac487cb9166351e8625"
# Final result: "184893a6-eafe-aac4-87cb-9166351e8625"
```
Signed-off-by: Doug Edgar <dedgar@redhat.com>
# What does this PR do?
adds dynamic model support to TGI
add new overwrite_completion_id feature to OpenAIMixin to deal with TGI
always returning id=""
## Test Plan
tgi: `docker run --gpus all --shm-size 1g -p 8080:80 -v /data:/data
ghcr.io/huggingface/text-generation-inference --model-id
Qwen/Qwen3-0.6B`
stack: `TGI_URL=http://localhost:8080 uv run llama stack build
--image-type venv --distro ci-tests --run`
test: `./scripts/integration-tests.sh --stack-config
http://localhost:8321 --setup tgi --subdirs inference --pattern openai`
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR provides functionality for users to unregister ScoringFn and
Benchmark resources for `scoring` and `eval` APIs.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#3051
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Updated integration and unit tests via CI workflow
# What does this PR do?
Migrates MD5 and SHA-1 hash algorithms to SHA-256.
In particular, replaces:
- MD5 in chunk ID generation.
- MD5 in file verification.
- SHA-1 in model identifier digests.
And updates all related test expectations.
Original discussion:
https://github.com/llamastack/llama-stack/discussions/3413
<!-- If resolving an issue, uncomment and update the line below -->
Closes#3424.
## Test Plan
Unit tests from scripts/unit-tests.sh were updated to match the new hash
output, and ran to verify the tests pass.
Signed-off-by: Doug Edgar <dedgar@redhat.com>
# What does this PR do?
update vLLM inference provider to use OpenAIMixin for openai-compat
functions
inference recordings from Qwen3-0.6B and vLLM 0.8.3 -
```
docker run --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --ipc=host \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-0.6B --enable-auto-tool-choice --tool-call-parser hermes
```
## Test Plan
```
./scripts/integration-tests.sh --stack-config server:ci-tests --setup vllm --subdirs inference
```
# What does this PR do?
- Updating documentation on migration from RAG Tool to Vector Stores and
Files APIs
- Adding exception handling for Vector Stores in RAG Tool
- Add more tests on migration from RAG Tool to Vector Stores
- Migrate off of inference_api for context_retriever for RAG
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Integration and unit tests added
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Fixes#3370
AWS switched to requiring region-prefixed inference profile IDs instead
of foundation model IDs for on-demand throughput. This was causing
ValidationException errors.
Added auto-detection based on boto3 client region to convert model IDs
like meta.llama3-1-70b-instruct-v1:0 to
us.meta.llama3-1-70b-instruct-v1:0 depending on the detected region.
Also handles edge cases like ARNs, case insensitive regions, and None
regions.
Tested with this request.
```json
{
"model_id": "meta.llama3-1-8b-instruct-v1:0",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "tell me a riddle"
}
],
"sampling_params": {
"strategy": {
"type": "top_p",
"temperature": 0.7,
"top_p": 0.9
},
"max_tokens": 512
}
}
```
<img width="1488" height="878" alt="image"
src="https://github.com/user-attachments/assets/0d61beec-3869-4a31-8f37-9f554c280b88"
/>
# What does this PR do?
The openai package is already a dependency of the llama-stack project
itself, so let's the project dictate which openai version we need and
avoid potential breakage with unsatisfiable dependency resolution.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Duplicate chat completion IDs can be generated during tests especially
if they are replaying recorded responses across different tests. No need
to warn or error under those circumstances. In the wild, this is not
likely to happen at all (no evidence) so we aren't really hiding any
problem.
# What does this PR do?
Adds a write worker queue for writes to inference store. This avoids
overwhelming request processing with slow inference writes.
## Test Plan
Benchmark:
```
cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct
```
## RPS from 21 -> 57
# What does this PR do?
- Use BackgroundLogger when logging metric events.
- Reuse event loop in BackgroundLogger
## Test Plan
```
cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct
```
### RPS from 57 -> 62
# What does this PR do?
update VertexAI inference provider to use openai-python for
openai-compat functions
## Test Plan
```
$ VERTEX_AI_PROJECT=... uv run llama stack build --image-type venv --providers inference=remote::vertexai --run
...
$ LLAMA_STACK_CONFIG=http://localhost:8321 uv run --group test pytest -v -ra --text-model vertexai/vertex_ai/gemini-2.5-flash tests/integration/inference/test_openai_completion.py
...
```
i don't have an account to test this. `get_api_key` may also need to be
updated per
https://cloud.google.com/vertex-ai/generative-ai/docs/start/openai
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
When running RAG in a multi vector DB setting, it can be difficult to
trace where retrieved chunks originate from. This PR adds the
`vector_db_id` into each chunk’s metadata, making it easier to
understand which database a given chunk came from. This is helpful for
debugging and for analyzing retrieval behavior of multiple DBs.
Relevant code:
```python
for vector_db_id, result in zip(vector_db_ids, results):
for chunk, score in zip(result.chunks, result.scores):
if not hasattr(chunk, "metadata") or chunk.metadata is None:
chunk.metadata = {}
chunk.metadata["vector_db_id"] = vector_db_id
chunks.append(chunk)
scores.append(score)
```
## Test Plan
* Ran Llama Stack in debug mode.
* Verified that `vector_db_id` was added to each chunk’s metadata.
* Confirmed that the metadata was printed in the console when using the
RAG tool.
---------
Co-authored-by: are-ces <cpompeia@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
update the Anthropic inference provider to use openai-python for the
openai-compat endpoints
## Test Plan
ci
Co-authored-by: raghotham <rsm@meta.com>
# What does this PR do?
update Groq inference provider to use OpenAIMixin for openai-compat
endpoints
changes on api.groq.com -
- json_schema is now supported for specific models, see
https://console.groq.com/docs/structured-outputs#supported-models
- response_format with streaming is now supported for models that
support response_format
- groq no longer returns a 400 error if tools are provided and
tool_choice is not "required"
## Test Plan
```
$ GROQ_API_KEY=... uv run llama stack build --image-type venv --providers inference=remote::groq --run
...
$ LLAMA_STACK_CONFIG=http://localhost:8321 uv run --group test pytest -v -ra --text-model groq/llama-3.3-70b-versatile tests/integration/inference/test_openai_completion.py -k 'not store'
...
SKIPPED [3] tests/integration/inference/test_openai_completion.py:44: Model groq/llama-3.3-70b-versatile hosted by remote::groq doesn't support OpenAI completions.
SKIPPED [3] tests/integration/inference/test_openai_completion.py:94: Model groq/llama-3.3-70b-versatile hosted by remote::groq doesn't support vllm extra_body parameters.
SKIPPED [4] tests/integration/inference/test_openai_completion.py:73: Model groq/llama-3.3-70b-versatile hosted by remote::groq doesn't support n param.
SKIPPED [1] tests/integration/inference/test_openai_completion.py💯 Model groq/llama-3.3-70b-versatile hosted by remote::groq doesn't support chat completion calls with base64 encoded files.
======================= 8 passed, 11 skipped, 8 deselected, 2 warnings in 5.13s ========================
```
---------
Co-authored-by: raghotham <rsm@meta.com>
# What does this PR do?
update SambaNova inference provider to use OpenAIMixin for openai-compat
endpoints
## Test Plan
```
$ SAMBANOVA_API_KEY=... uv run llama stack build --image-type venv --providers inference=remote::sambanova --run
...
$ LLAMA_STACK_CONFIG=http://localhost:8321 uv run --group test pytest -v -ra --text-model sambanova/Meta-Llama-3.3-70B-Instruct tests/integration/inference -k 'not store'
...
FAILED tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=sambanova/Meta-Llama-3.3-70B-Instruct-inference:chat_completion:tool_calling_tools_absent-True] - AttributeError: 'NoneType' object has no attribute 'delta'
FAILED tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=sambanova/Meta-Llama-3.3-70B-Instruct-inference:chat_completion:tool_calling_tools_absent-False] - llama_stack_client.InternalServerError: Error code: 500 - {'detail': 'Internal server error: An une...
=========== 2 failed, 16 passed, 68 skipped, 8 deselected, 3 xfailed, 13 warnings in 15.85s ============
```
the two failures also exist before this change. they are part of the
deprecated inference.chat_completion tests that flow through litellm.
they can be resolved later.
# What does this PR do?
update the Gemini inference provider to use openai-python for the
openai-compat endpoints
partially addresses #3349, does not address /inference/completion or
/inference/chat-completion
## Test Plan
ci
# What does this PR do?
Improved bedrock provider config to read from environment variables like
AWS_ACCESS_KEY_ID. Updated all
fields to use default_factory with lambda patterns like the nvidia
provider does.
Now the environment variables work as documented.
Closes#3305
## Test Plan
Ran the new bedrock config tests:
```bash
python -m pytest tests/unit/providers/inference/bedrock/test_config.py
-v
Verified existing provider tests still work:
python -m pytest tests/unit/providers/test_configs.py -v
What does this PR do?
Fixes error handling when MCP server connections fail. Instead of
returning generic 500 errors, now provides
descriptive error messages with proper HTTP status codes.
Closes#3107
Test Plan
Before fix:
curl -X GET
"http://localhost:8321/v1/tool-runtime/list-tools?tool_group_id=bad-mcp-server"
Returns: {"detail": "Internal server error: An unexpected error
occurred."} (500)
After fix:
curl -X GET
"http://localhost:8321/v1/tool-runtime/list-tools?tool_group_id=bad-mcp-server"
Returns: {"error": {"detail": "Failed to connect to MCP server at
http://localhost:9999/sse: Connection
refused"}} (502)
Tests:
- Added unit test for ConnectionError → 502 translation
- Manually tested with unreachable MCP servers (connection refused)
# What does this PR do?
Noticed the test
https://github.com/llamastack/llama-stack-ops/actions/workflows/test-maybe-cut.yaml
are still failing randomly.
Earlier fixed this with 0.18.0 of fireworks here
https://github.com/llamastack/llama-stack/pull/3267, the local testing
may have inadvertently picked a lower version with `<=` which I assumed
picks latest version.
Now tested with `==` to find the version where it broke and pinning to
version(`<=`) where it was passing.
## Test Plan
Tested locally with the following commands to start a container
Build container
`llama stack build --distro starter --image-type container`
start container `docker run -d -p 8321:8321 --name llama-stack-test
distribution-starter:0.2.20`
check health `http://localhost:8321/v1/health`
Above steps fails without the fix
Tested with `==` to ensure the same version is picked in local testing
instead of anything lower.
Following here for the fix from `fireworks-ai`
1410674695https://github.com/llamastack/llama-stack/issues/3273
- Wrap model loading with asyncio.to_thread() to prevent blocking during
model download/initialization
- Wrap encoding operations with asyncio.to_thread() to run in background
thread
- Convert _load_sentence_transformer_model() to async method
This ensures the async event loop remains responsive during embedding
operations.
Closes: #3332
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
One needed to specify record-replay related environment variables for
running integration tests. We could not use defaults because integration
tests could be run against Ollama instances which could be running
different models. For example, text vs vision tests needed separate
instances of Ollama because a single instance typically cannot serve
both of these models if you assume the standard CI worker configuration
on Github. As a result, `client.list()` as returned by the Ollama client
would be different between these runs and we'd end up overwriting
responses.
This PR "solves" it by adding a small amount of complexity -- we store
model list responses specially, keyed by the hashes of the models they
return. At replay time, we merge all of them and pretend that we have
the union of all models available.
## Test Plan
Re-recorded all the tests using `scripts/integration-tests.sh
--inference-mode record`, including the vision tests.
# What does this PR do?
This PR updates the Watsonx provider dependencies from
`ibm_watson_machine_learning` to `ibm_watsonx_ai`.
The old package `ibm_watson_machine_learning` is in **deprecation mode**
([[PyPI
link](https://pypi.org/project/ibm-watson-machine-learning/)](https://pypi.org/project/ibm-watson-machine-learning/))
and relies on older versions of dependencies such as `pandas`. Updating
to `ibm_watsonx_ai` ensures compatibility with current dependency
versions and ongoing support.
## Test Plan
I verified the update by running an inference using a model provided by
Watsonx. The model ran successfully, confirming that the new dependency
works as expected.
Co-authored-by: are-ces <cpompeia@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
The purpose of this PR is to refactor `SQLiteVecIndex` to eliminate
redundant code and simplify the code using generic
`WeightedInMemoryAggregator` that can be used for any vector db
provider. This pattern is already implemented for `PGVectorIndex` in
#3064
CC: @franciscojavierarceo
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
1. `./scripts/unit-tests.sh`
2. Integration tests in CI Workflow
# What does this PR do?
BFCL scoring function is not supported, removing it.
Also minor fixes as the llama stack run is broken for open-benchmark for
test plan verification
1. Correct the model paths for supported models
2. Fix another issue as there is no `provider_id` for DatasetInput but
logger assumes it exists.
```
File "/Users/swapna942/llama-stack/llama_stack/core/stack.py", line 332, in construct_stack
await register_resources(run_config, impls)
File "/Users/swapna942/llama-stack/llama_stack/core/stack.py", line 108, in register_resources
logger.debug(f"registering {rsrc.capitalize()} {obj} for provider {obj.provider_id}")
^^^^^^^^^^^^^^^
File "/Users/swapna942/llama-stack/.venv/lib/python3.13/site-packages/pydantic/main.py", line 991, in __getattr__
raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'DatasetInput' object has no attribute 'provider_id'
```
## Test Plan
```llama stack build --distro open-benchmark --image-type venv``` and run the server succeeds
Issue Link: https://github.com/llamastack/llama-stack/issues/3282
# What does this PR do?
add the ability to use inequalities in the where clause of the sqlstore.
this is infrastructure for files expiration.
## Test Plan
unit tests
# What does this PR do?
1725364988
Fixes the issue with open ai package incompatibilty introduced through
new dependency of fireworks-ai==0.19.18->reward-kit by pinning to
fireworks older version that doesnt pull in reward-kit
## Test Plan
Tested locally with the following commands to start a container
1. Build container
`llama stack build --distro starter --image-type container`
2. start container `docker run -d -p 8321:8321 --name llama-stack-test
distribution-starter:0.2.19`
3. check health http://localhost:8321/v1/health
Above steps fails without the fix
The `trl` dependency brings in `accelerate` which brings in nvidia
dependencies for torch. We cannot have that in the starter distro. As
such, no CPU-only post-training for the huggingface provider.
The starter distribution added post-training which added torch
dependencies which pulls in all the nvidia CUDA libraries. This made our
starter container very big. We have worked hard to keep the starter
container small so it serves its purpose as a starter. This PR tries to
get it back to its size by forking off duplicate "-gpu" providers for
post-training. These forked providers are then used for a new
`starter-gpu` distribution which can pull in all dependencies.
# What does this PR do?
closes https://github.com/llamastack/llama-stack/issues/3236
mypy considered our default implementations (raise NotImplementedError)
to be trivial. the result was we implemented the same stubs in
providers.
this change puts enough into the default impls so mypy considers them
non-trivial. this allows us to remove the duplicate implementations.