Fixed KeyError when chunks don't have document_id in metadata or
chunk_metadata. Updated logging to safely extract document_id using
getattr and RAG memory to handle different document_id locations. Added
test for missing document_id scenarios.
Fixes issue #3494 where /v1/vector-io/insert would crash with KeyError.
Fixed KeyError when chunks don't have document_id in metadata or
chunk_metadata. Updated logging to safely extract document_id using
getattr and RAG memory to handle different document_id locations. Added
test for missing document_id scenarios.
# What does this PR do?
Fixes a KeyError crash in `/v1/vector-io/insert` when chunks are missing
`document_id` fields. The API
was failing even though `document_id` is optional according to the
schema.
Closes#3494
## Test Plan
**Before fix:**
- POST to `/v1/vector-io/insert` with chunks → 500 KeyError
- Happened regardless of where `document_id` was placed
**After fix:**
- Same request works fine → 200 OK
- Tested with Postman using FAISS backend
- Added unit test covering missing `document_id` scenarios
This PR updates the Conversation item related types and improves a
couple critical parts of the implemenation:
- it creates a streaming output item for the final assistant message
output by
the model. until now we only added content parts and included that
message in the final response.
- rewrites the conversation update code completely to account for items
other than messages (tool calls, outputs, etc.)
## Test Plan
Used the test script from
https://github.com/llamastack/llama-stack-client-python/pull/281 for
this
```
TEST_API_BASE_URL=http://localhost:8321/v1 \
pytest tests/integration/test_agent_turn_step_events.py::test_client_side_function_tool -xvs
```
# Add support for Google Gemini `gemini-embedding-001` embedding model
and correctly registers model type
MR message created with the assistance of Claude-4.5-sonnet
This resolves https://github.com/llamastack/llama-stack/issues/3755
## What does this PR do?
This PR adds support for the `gemini-embedding-001` Google embedding
model to the llama-stack Gemini provider. This model provides
high-dimensional embeddings (3072 dimensions) compared to the existing
`text-embedding-004` model (768 dimensions). Old embeddings models (such
as text-embedding-004) will be deprecated soon according to Google
([Link](https://developers.googleblog.com/en/gemini-embedding-available-gemini-api/))
## Problem
The Gemini provider only supported the `text-embedding-004` embedding
model. The newer `gemini-embedding-001` model, which provides
higher-dimensional embeddings for improved semantic representation, was
not available through llama-stack.
## Solution
This PR consists of three commits that implement, fix the model
registration, and enable embedding generation:
### Commit 1: Initial addition of gemini-embedding-001
Added metadata for `gemini-embedding-001` to the
`embedding_model_metadata` dictionary:
```python
embedding_model_metadata: dict[str, dict[str, int]] = {
"text-embedding-004": {"embedding_dimension": 768, "context_length": 2048},
"gemini-embedding-001": {"embedding_dimension": 3072, "context_length": 2048}, # NEW
}
```
**Issue discovered:** The model was not being registered correctly
because the dictionary keys didn't match the model IDs returned by
Gemini's API.
### Commit 2: Fix model ID matching with `models/` prefix
Updated both dictionary keys to include the `models/` prefix to match
Gemini's OpenAI-compatible API response format:
```python
embedding_model_metadata: dict[str, dict[str, int]] = {
"models/text-embedding-004": {"embedding_dimension": 768, "context_length": 2048}, # UPDATED
"models/gemini-embedding-001": {"embedding_dimension": 3072, "context_length": 2048}, # UPDATED
}
```
**Root cause:** Gemini's OpenAI-compatible API returns model IDs with
the `models/` prefix (e.g., `models/text-embedding-004`). The
`OpenAIMixin.list_models()` method directly matches these IDs against
the `embedding_model_metadata` dictionary keys. Without the prefix, the
models were being registered as LLMs instead of embedding models.
### Commit 3: Fix embedding generation for providers without usage stats
Fixed a bug in `OpenAIMixin.openai_embeddings()` that prevented
embedding generation for providers (like Gemini) that don't return usage
statistics:
```python
# Before (Line 351-354):
usage = OpenAIEmbeddingUsage(
prompt_tokens=response.usage.prompt_tokens, # ← Crashed with AttributeError
total_tokens=response.usage.total_tokens,
)
# After (Lines 351-362):
if response.usage:
usage = OpenAIEmbeddingUsage(
prompt_tokens=response.usage.prompt_tokens,
total_tokens=response.usage.total_tokens,
)
else:
usage = OpenAIEmbeddingUsage(
prompt_tokens=0, # Default when not provided
total_tokens=0, # Default when not provided
)
```
**Impact:** This fix enables embedding generation for **all** Gemini
embedding models, not just the newly added one.
## Changes
### Modified Files
**`llama_stack/providers/remote/inference/gemini/gemini.py`**
- Line 17: Updated `text-embedding-004` key to
`models/text-embedding-004`
- Line 18: Added `models/gemini-embedding-001` with correct metadata
**`llama_stack/providers/utils/inference/openai_mixin.py`**
- Lines 351-362: Added null check for `response.usage` to handle
providers without usage statistics
## Key Technical Details
### Model ID Matching Flow
1. `list_provider_model_ids()` calls Gemini's `/v1/models` endpoint
2. API returns model IDs like: `models/text-embedding-004`,
`models/gemini-embedding-001`
3. `OpenAIMixin.list_models()` (line 410) checks: `if metadata :=
self.embedding_model_metadata.get(provider_model_id)`
4. If matched, registers as `model_type: "embedding"` with metadata;
otherwise registers as `model_type: "llm"`
### Why Both Keys Needed the Prefix
The `text-embedding-004` model was already working because there was
likely separate configuration or manual registration handling it. For
auto-discovery to work correctly for **both** models, both keys must
match the API's model ID format exactly.
## How to test this PR
Verified the changes by:
1. **Model Auto-Discovery**: Started llama-stack server and confirmed
models are auto-discovered from Gemini API
2. **Model Registration**: Confirmed both embedding models are correctly
registered and visible
```bash
curl http://localhost:8325/v1/models | jq '.data[] | select(.provider_id == "gemini" and .model_type == "embedding")'
```
**Results:**
- ✅ `gemini/models/text-embedding-004` - 768 dimensions - `model_type:
"embedding"`
- ✅ `gemini/models/gemini-embedding-001` - 3072 dimensions -
`model_type: "embedding"`
3. **Before Fix (Commit 1)**: Models appeared as `model_type: "llm"`
without embedding metadata
4. **After Fix (Commit 2)**: Models correctly identified as `model_type:
"embedding"` with proper metadata
5. **Generate Embeddings**: Verified embedding generation works
```bash
curl -X POST http://localhost:8325/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model": "gemini/models/gemini-embedding-001", "input": "test"}' | \
jq '.data[0].embedding | length'
```
# What does this PR do?
Enables automatic embedding model detection for vector stores and by
using a `default_configured` boolean that can be defined in the
`run.yaml`.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
- Unit tests
- Integration tests
- Simple example below:
Spin up the stack:
```bash
uv run llama stack build --distro starter --image-type venv --run
```
Then test with OpenAI's client:
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")
vs = client.vector_stores.create()
```
Previously you needed:
```python
vs = client.vector_stores.create(
extra_body={
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
"embedding_dimension": 384,
}
)
```
The `extra_body` is now unnecessary.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Previously, the NVIDIA inference provider implemented a custom
`openai_embeddings` method with a hardcoded `input_type="query"`
parameter, which is required by NVIDIA asymmetric embedding
models([https://github.com/llamastack/llama-stack/pull/3205](https://github.com/llamastack/llama-stack/pull/3205)).
Recently `extra_body` parameter is added to the embeddings API
([https://github.com/llamastack/llama-stack/pull/3794](https://github.com/llamastack/llama-stack/pull/3794)).
So, this PR updates the NVIDIA inference provider to use the base
`OpenAIMixin.openai_embeddings` method instead and pass the `input_type`
through the `extra_body` parameter for asymmetric embedding models.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Run the following command for the ```embedding_model```:
```nvidia/llama-3.2-nv-embedqa-1b-v2```, ```nvidia/nv-embedqa-e5-v5```,
```nvidia/nv-embedqa-mistral-7b-v2```, and
```snowflake/arctic-embed-l```.
```
pytest -s -v tests/integration/inference/test_openai_embeddings.py --stack-config="inference=nvidia" --embedding-model={embedding_model} --env NVIDIA_API_KEY={nvidia_api_key} --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com" --inference-mode=record
```
# What does this PR do?
As discussed on discord, we do not need to reinvent the wheel for
telemetry. Instead we'll lean into the canonical OTEL stack.
Logs/traces/metrics will still be sent via OTEL - they just won't be
stored on, queried through Stack.
This is the first of many PRs to remove telemetry API from Stack.
1) removed webmethod decorators to remove from API spec
2) removed tests as @iamemilio is adding them on otel directly.
## Test Plan
# What does this PR do?
Updates CONTRIBUTING.md with the following changes:
- Use Python 3.12 (and why)
- Use pre-commit==4.3.0
- Recommend using -v with pre-commit to get detailed info about why it
is failing if it fails.
- Instructs users to go to the docs/ directory before rebuilding the
docs (it doesn't work unless you do that).
Signed-off-by: Bill Murdock <bmurdock@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
The purpose of this PR is to replace the Llama Stack's default embedding
model by nomic-embed-text-v1.5.
These are the key reasons why Llama Stack community decided to switch
from all-MiniLM-L6-v2 to nomic-embed-text-v1.5:
1. The training data for
[all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#training-data)
includes a lot of data sets with various licensing terms, so it is
tricky to know when/whether it is appropriate to use this model for
commercial applications.
2. The model is not particularly competitive on major benchmarks. For
example, if you look at the [MTEB
Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) and click
on Miscellaneous/BEIR to see English information retrieval accuracy, you
see that the top of the leaderboard is dominated by enormous models but
also that there are many, many models of relatively modest size whith
much higher Retrieval scores. If you want to look closely at the data, I
recommend clicking "Download Table" because it is easier to browse that
way.
More discussion info can be founded
[here](https://github.com/llamastack/llama-stack/issues/2418)
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2418
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
1. Run `./scripts/unit-tests.sh`
2. Integration tests via CI wokrflow
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This PR fixes issues with the WatsonX provider so it works correctly
with LiteLLM.
The main problem was that WatsonX requests failed because the provider
data validator didn’t properly handle the API key and project ID. This
was fixed by updating the WatsonXProviderDataValidator and ensuring the
provider data is loaded correctly.
The openai_chat_completion method was also updated to match the behavior
of other providers while adding WatsonX-specific fields like project_id.
It still calls await super().openai_chat_completion.__func__(self,
params) to keep the existing setup and tracing logic.
After these changes, WatsonX requests now run correctly.
## Test Plan
The changes were tested by running chat completion requests and
confirming that credentials and project parameters are passed correctly.
I have tested with my WatsonX credentials, by using the cli with `uv run
llama-stack-client inference chat-completion --session`
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This commit migrates the authentication system from python-jose to PyJWT
to eliminate the dependency on the archived rsa package. The migration
includes:
- Refactored OAuth2TokenAuthProvider to use PyJWT's PyJWKClient for
clean JWKS handling
- Removed manual JWKS fetching, caching and key extraction logic in
favor of PyJWT's built-in functionality
The new implementation is cleaner, more maintainable, and follows PyJWT
best practices while maintaining full backward compatibility.
## Test Plan
Unit tests. Auth CI.
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
2 main changes:
1. Remove `provider_id` requirement in call to vector stores and
2. Removes "register first embedding model" logic
- Now forces embedding model id as required on Vector Store creation
Simplifies the UX for OpenAI to:
```python
vs = client.vector_stores.create(
name="my_citations_db",
extra_body={
"embedding_model": "ollama/nomic-embed-text:latest",
}
)
```
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Fixed CI job to check the correct directory for file changes Artifacts
are now stored in multiple directories not just
./tests/integration/recordings
Signed-off-by: Derek Higgins <derekh@redhat.com>