Commit graph

8 commits

Author SHA1 Message Date
Juan Pérez de Algaba
add8cd801b
feat(gemini): Support gemini-embedding-001 and fix models/ prefix in metadata keys (#3813)
# Add support for Google Gemini `gemini-embedding-001` embedding model
and correctly registers model type

MR message created with the assistance of Claude-4.5-sonnet

This resolves https://github.com/llamastack/llama-stack/issues/3755

## What does this PR do?

This PR adds support for the `gemini-embedding-001` Google embedding
model to the llama-stack Gemini provider. This model provides
high-dimensional embeddings (3072 dimensions) compared to the existing
`text-embedding-004` model (768 dimensions). Old embeddings models (such
as text-embedding-004) will be deprecated soon according to Google
([Link](https://developers.googleblog.com/en/gemini-embedding-available-gemini-api/))

## Problem

The Gemini provider only supported the `text-embedding-004` embedding
model. The newer `gemini-embedding-001` model, which provides
higher-dimensional embeddings for improved semantic representation, was
not available through llama-stack.

## Solution

This PR consists of three commits that implement, fix the model
registration, and enable embedding generation:

### Commit 1: Initial addition of gemini-embedding-001

Added metadata for `gemini-embedding-001` to the
`embedding_model_metadata` dictionary:

```python
embedding_model_metadata: dict[str, dict[str, int]] = {
    "text-embedding-004": {"embedding_dimension": 768, "context_length": 2048},
    "gemini-embedding-001": {"embedding_dimension": 3072, "context_length": 2048},  # NEW
}
```

**Issue discovered:** The model was not being registered correctly
because the dictionary keys didn't match the model IDs returned by
Gemini's API.

### Commit 2: Fix model ID matching with `models/` prefix

Updated both dictionary keys to include the `models/` prefix to match
Gemini's OpenAI-compatible API response format:

```python
embedding_model_metadata: dict[str, dict[str, int]] = {
    "models/text-embedding-004": {"embedding_dimension": 768, "context_length": 2048},      # UPDATED
    "models/gemini-embedding-001": {"embedding_dimension": 3072, "context_length": 2048},  # UPDATED
}
```

**Root cause:** Gemini's OpenAI-compatible API returns model IDs with
the `models/` prefix (e.g., `models/text-embedding-004`). The
`OpenAIMixin.list_models()` method directly matches these IDs against
the `embedding_model_metadata` dictionary keys. Without the prefix, the
models were being registered as LLMs instead of embedding models.

### Commit 3: Fix embedding generation for providers without usage stats

Fixed a bug in `OpenAIMixin.openai_embeddings()` that prevented
embedding generation for providers (like Gemini) that don't return usage
statistics:

```python
# Before (Line 351-354):
usage = OpenAIEmbeddingUsage(
    prompt_tokens=response.usage.prompt_tokens,  # ← Crashed with AttributeError
    total_tokens=response.usage.total_tokens,
)

# After (Lines 351-362):
if response.usage:
    usage = OpenAIEmbeddingUsage(
        prompt_tokens=response.usage.prompt_tokens,
        total_tokens=response.usage.total_tokens,
    )
else:
    usage = OpenAIEmbeddingUsage(
        prompt_tokens=0,  # Default when not provided
        total_tokens=0,   # Default when not provided
    )
```

**Impact:** This fix enables embedding generation for **all** Gemini
embedding models, not just the newly added one.

## Changes

### Modified Files

**`llama_stack/providers/remote/inference/gemini/gemini.py`**
- Line 17: Updated `text-embedding-004` key to
`models/text-embedding-004`
- Line 18: Added `models/gemini-embedding-001` with correct metadata

**`llama_stack/providers/utils/inference/openai_mixin.py`**
- Lines 351-362: Added null check for `response.usage` to handle
providers without usage statistics

## Key Technical Details

### Model ID Matching Flow

1. `list_provider_model_ids()` calls Gemini's `/v1/models` endpoint
2. API returns model IDs like: `models/text-embedding-004`,
`models/gemini-embedding-001`
3. `OpenAIMixin.list_models()` (line 410) checks: `if metadata :=
self.embedding_model_metadata.get(provider_model_id)`
4. If matched, registers as `model_type: "embedding"` with metadata;
otherwise registers as `model_type: "llm"`

### Why Both Keys Needed the Prefix

The `text-embedding-004` model was already working because there was
likely separate configuration or manual registration handling it. For
auto-discovery to work correctly for **both** models, both keys must
match the API's model ID format exactly.

## How to test this PR

Verified the changes by:

1. **Model Auto-Discovery**: Started llama-stack server and confirmed
models are auto-discovered from Gemini API

2. **Model Registration**: Confirmed both embedding models are correctly
registered and visible
```bash
curl http://localhost:8325/v1/models | jq '.data[] | select(.provider_id == "gemini" and .model_type == "embedding")'
```

**Results:**
-  `gemini/models/text-embedding-004` - 768 dimensions - `model_type:
"embedding"`
-  `gemini/models/gemini-embedding-001` - 3072 dimensions -
`model_type: "embedding"`

3. **Before Fix (Commit 1)**: Models appeared as `model_type: "llm"`
without embedding metadata

4. **After Fix (Commit 2)**: Models correctly identified as `model_type:
"embedding"` with proper metadata

5. **Generate Embeddings**: Verified embedding generation works
```bash
curl -X POST http://localhost:8325/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model": "gemini/models/gemini-embedding-001", "input": "test"}' | \
  jq '.data[0].embedding | length'
```
2025-10-15 12:22:10 -04:00
Matthew Farrellee
0066d986c5
feat: use SecretStr for inference provider auth credentials (#3724)
# What does this PR do?

use SecretStr for OpenAIMixin providers

- RemoteInferenceProviderConfig now has auth_credential: SecretStr
- the default alias is api_key (most common name)
- some providers override to use api_token (RunPod, vLLM, Databricks)
- some providers exclude it (Ollama, TGI, Vertex AI)

addresses #3517 

## Test Plan

ci w/ new tests
2025-10-10 07:32:50 -07:00
Matthew Farrellee
d23ed26238
chore: turn OpenAIMixin into a pydantic.BaseModel (#3671)
# What does this PR do?

- implement get_api_key instead of relying on
LiteLLMOpenAIMixin.get_api_key
 - remove use of LiteLLMOpenAIMixin
 - add default initialize/shutdown methods to OpenAIMixin
 - remove __init__s to allow proper pydantic construction
- remove dead code from vllm adapter and associated / duplicate unit
tests
 - update vllm adapter to use openaimixin for model registration
 - remove ModelRegistryHelper from fireworks & together adapters
 - remove Inference from nvidia adapter
 - complete type hints on embedding_model_metadata
- allow extra fields on OpenAIMixin, for model_store, __provider_id__,
etc
 - new recordings for ollama
 - enhance the list models error handling
- update cerebras (remove cerebras-cloud-sdk) and anthropic (custom
model listing) inference adapters
 - parametrized test_inference_client_caching
- remove cerebras, databricks, fireworks, together from blanket mypy
exclude
 - removed unnecessary litellm deps

## Test Plan

ci
2025-10-06 11:33:19 -04:00
Matthew Farrellee
b67aef2fc4
feat: add static embedding metadata to dynamic model listings for providers using OpenAIMixin (#3547)
# What does this PR do?

- remove auto-download of ollama embedding models
- add embedding model metadata to dynamic listing w/ unit test
- add support and tests for allowed_models
- removed inference provider models.py files where dynamic listing is
enabled
- store embedding metadata in embedding_model_metadata field on
inference providers
- make model_entries optional on ModelRegistryHelper and
LiteLLMOpenAIMixin
- make OpenAIMixin a ModelRegistryHelper
- skip base64 embedding test for remote::ollama, always returns floats
- only use OpenAI client for ollama model listing
- remove unused build_model_entry function
- remove unused get_huggingface_repo function


## Test Plan

ci w/ new tests
2025-09-25 17:17:00 -04:00
Matthew Farrellee
d6c3b36390
chore: update the gemini inference impl to use openai-python for openai-compat functions (#3351)
# What does this PR do?

update the Gemini inference provider to use openai-python for the
openai-compat endpoints

partially addresses #3349, does not address /inference/completion or
/inference/chat-completion

## Test Plan

ci
2025-09-06 12:22:20 -07:00
Ashwin Bharambe
9583f468f8
feat(starter)!: simplify starter distro; litellm model registry changes (#2916) 2025-07-25 15:02:04 -07:00
Ashwin Bharambe
928a39d17b
feat(providers): Groq now uses LiteLLM openai-compat (#1303)
Groq has never supported raw completions anyhow. So this makes it easier
to switch it to LiteLLM. All our test suite passes.

I also updated all the openai-compat providers so they work with api
keys passed from headers. `provider_data`

## Test Plan

```bash
LLAMA_STACK_CONFIG=groq \
   pytest -s -v tests/client-sdk/inference/test_text_inference.py \
   --inference-model=groq/llama-3.3-70b-versatile --vision-inference-model=""
```

Also tested (openai, anthropic, gemini) providers. No regressions.
2025-02-27 13:16:50 -08:00
Ashwin Bharambe
63e6acd0c3
feat: add (openai, anthropic, gemini) providers via litellm (#1267)
# What does this PR do?

This PR introduces more non-llama model support to llama stack.
Providers introduced: openai, anthropic and gemini. All of these
providers use essentially the same piece of code -- the implementation
works via the `litellm` library.

We will expose only specific models for providers we enable making sure
they all work well and pass tests. This setup (instead of automatically
enabling _all_ providers and models allowed by LiteLLM) ensures we can
also perform any needed prompt tuning on a per-model basis as needed
(just like we do it for llama models.)

## Test Plan

```bash
#!/bin/bash

args=("$@")
for model in openai/gpt-4o anthropic/claude-3-5-sonnet-latest gemini/gemini-1.5-flash; do
    LLAMA_STACK_CONFIG=dev pytest -s -v tests/client-sdk/inference/test_text_inference.py \
        --embedding-model=all-MiniLM-L6-v2 \
        --vision-inference-model="" \
        --inference-model=$model "${args[@]}"
done
```
2025-02-25 22:07:33 -08:00