Add configurable embedding models for vector IO providers

This change lets users configure default embedding models at the provider level instead of always relying on system defaults. Each vector store provider can now specify an embedding_model and optional embedding_dimension in their config.

Key features:
- Auto-dimension lookup for standard models from the registry
- Support for Matryoshka embeddings with custom dimensions
- Three-tier priority: explicit params > provider config > system fallback
- Full backward compatibility - existing setups work unchanged
- Comprehensive test coverage with 20 test cases

Updated all vector IO providers (FAISS, Chroma, Milvus, Qdrant, etc.) with the new config fields and added detailed documentation with examples.

Fixes #2729
This commit is contained in:
skamenan7 2025-07-15 16:46:40 -04:00
parent 2298d2473c
commit 474b50b422
28 changed files with 1160 additions and 24 deletions

View file

@ -0,0 +1,127 @@
# Sample Vector IO Configuration with Embedding Model Defaults
# This example demonstrates the new provider-level embedding configuration features
# Image and version info
version: 3
image_name: my-embedding-app
# APIs to serve
apis:
- inference
- vector_io
# Provider configurations
providers:
# Inference provider for embedding models
inference:
- provider_id: local_inference
provider_type: inline::ollama
config:
url: http://localhost:11434
# Vector IO providers with embedding model defaults
vector_io:
# FAISS for fast local search with lightweight embeddings
- provider_id: fast_local_search
provider_type: inline::faiss
config:
kvstore:
provider_type: sqlite
config:
db_path: ~/.llama/distributions/my-app/faiss_store.db
# NEW: Default embedding model for this provider
embedding_model: "all-MiniLM-L6-v2"
# Dimension auto-lookup: 384 (from model registry)
# SQLite Vec for lightweight vector storage with Matryoshka embeddings
- provider_id: compact_storage
provider_type: inline::sqlite_vec
config:
db_path: ~/.llama/distributions/my-app/sqlite_vec.db
kvstore:
provider_type: sqlite
config:
db_name: sqlite_vec_registry.db
# Matryoshka embedding with custom dimension
embedding_model: "nomic-embed-text"
embedding_dimension: 256 # Reduced from default 768 for efficiency
# Chroma for persistent local storage
- provider_id: persistent_search
provider_type: inline::chroma
config:
db_path: ~/.llama/distributions/my-app/chroma.db
# High-quality embeddings for semantic search
embedding_model: "sentence-transformers/all-mpnet-base-v2"
# Auto-lookup dimension from model registry
# Qdrant Cloud for production-scale search (when available)
- provider_id: cloud_search
provider_type: remote::qdrant
config:
api_key: "${env.QDRANT_API_KEY}"
url: "${env.QDRANT_URL}"
# Production-grade embedding model
embedding_model: "text-embedding-3-small"
embedding_dimension: 512 # Custom dimension for performance
# Model registry - ensure embedding models are properly configured
models:
# Lightweight embedding model (384 dimensions)
- model_id: all-MiniLM-L6-v2
provider_id: local_inference
provider_model_id: sentence-transformers/all-MiniLM-L6-v2
model_type: embedding
metadata:
embedding_dimension: 384
description: "Fast, lightweight embeddings for general use"
# Matryoshka embedding model (variable dimensions)
- model_id: nomic-embed-text
provider_id: local_inference
provider_model_id: nomic-embed-text
model_type: embedding
metadata:
embedding_dimension: 768 # Default, can be overridden
description: "Flexible Matryoshka embeddings supporting variable dimensions"
# High-quality embedding model (768 dimensions)
- model_id: sentence-transformers/all-mpnet-base-v2
provider_id: local_inference
provider_model_id: sentence-transformers/all-mpnet-base-v2
model_type: embedding
metadata:
embedding_dimension: 768
description: "High-quality embeddings for semantic search"
# OpenAI embedding model (for cloud usage)
- model_id: text-embedding-3-small
provider_id: openai_inference # Would need OpenAI provider configured
provider_model_id: text-embedding-3-small
model_type: embedding
metadata:
embedding_dimension: 1536 # Default OpenAI dimension
description: "OpenAI's efficient embedding model"
# Optional: Configure specific vector databases (will use provider defaults)
vector_dbs:
# Uses fast_local_search provider defaults (all-MiniLM-L6-v2, 384 dims)
- vector_db_id: general_docs
provider_id: fast_local_search
# Uses compact_storage provider defaults (nomic-embed-text, 256 dims)
- vector_db_id: compressed_knowledge
provider_id: compact_storage
# Uses persistent_search provider defaults (all-mpnet-base-v2, 768 dims)
- vector_db_id: semantic_library
provider_id: persistent_search
# Server configuration
server:
host: 0.0.0.0
port: 5000
# Logging configuration
logging:
level: INFO

View file

@ -0,0 +1,302 @@
# Vector IO Embedding Model Configuration
## Overview
Vector IO providers now support configuring default embedding models at the provider level. This allows you to:
- Set a default embedding model for each vector store provider
- Support Matryoshka embeddings with custom dimensions
- Automatic dimension lookup from the model registry
- Maintain backward compatibility with existing configurations
## Configuration Options
### Provider-Level Embedding Configuration
Add `embedding_model` and `embedding_dimension` fields to your vector IO provider configuration:
```yaml
providers:
vector_io:
- provider_id: my_faiss_store
provider_type: inline::faiss
config:
kvstore:
provider_type: sqlite
config:
db_path: ~/.llama/distributions/my-app/faiss_store.db
# NEW: Configure default embedding model
embedding_model: "all-MiniLM-L6-v2"
# Optional: Only needed for variable-dimension models
# embedding_dimension: 384
```
### Embedding Model Selection Priority
The system uses a 3-tier priority system for selecting embedding models:
1. **Explicit API Parameters** (highest priority)
```python
# API call explicitly specifies model - this takes precedence
await vector_io.openai_create_vector_store(
name="my-store",
embedding_model="nomic-embed-text", # Explicit override
embedding_dimension=256,
)
```
2. **Provider Config Defaults** (middle priority)
```yaml
# Provider config provides default when no explicit model specified
config:
embedding_model: "all-MiniLM-L6-v2"
embedding_dimension: 384
```
3. **System Default** (fallback)
```
# Uses first available embedding model from model registry
# Maintains backward compatibility
```
## Provider Examples
### FAISS with Default Embedding Model
```yaml
providers:
vector_io:
- provider_id: faiss_store
provider_type: inline::faiss
config:
kvstore:
provider_type: sqlite
config:
db_path: ~/.llama/distributions/my-app/faiss_store.db
embedding_model: "all-MiniLM-L6-v2"
# Dimension auto-lookup: 384 (from model registry)
```
### SQLite Vec with Matryoshka Embedding
```yaml
providers:
vector_io:
- provider_id: sqlite_vec_store
provider_type: inline::sqlite_vec
config:
db_path: ~/.llama/distributions/my-app/sqlite_vec.db
kvstore:
provider_type: sqlite
config:
db_name: sqlite_vec_registry.db
embedding_model: "nomic-embed-text"
embedding_dimension: 256 # Override default 768 to 256
```
### Chroma with Provider Default
```yaml
providers:
vector_io:
- provider_id: chroma_store
provider_type: inline::chroma
config:
db_path: ~/.llama/distributions/my-app/chroma.db
embedding_model: "sentence-transformers/all-mpnet-base-v2"
# Auto-lookup dimension from model registry
```
### Remote Qdrant Configuration
```yaml
providers:
vector_io:
- provider_id: qdrant_cloud
provider_type: remote::qdrant
config:
api_key: "${env.QDRANT_API_KEY}"
url: "https://my-cluster.qdrant.tech"
embedding_model: "text-embedding-3-small"
embedding_dimension: 512 # Custom dimension for Matryoshka model
```
### Multiple Providers with Different Models
```yaml
providers:
vector_io:
# Fast, lightweight embeddings for simple search
- provider_id: fast_search
provider_type: inline::faiss
config:
kvstore:
provider_type: sqlite
config:
db_path: ~/.llama/fast_search.db
embedding_model: "all-MiniLM-L6-v2" # 384 dimensions
# High-quality embeddings for semantic search
- provider_id: semantic_search
provider_type: remote::qdrant
config:
api_key: "${env.QDRANT_API_KEY}"
embedding_model: "text-embedding-3-large" # 3072 dimensions
# Flexible Matryoshka embeddings
- provider_id: flexible_search
provider_type: inline::chroma
config:
db_path: ~/.llama/flexible_search.db
embedding_model: "nomic-embed-text"
embedding_dimension: 256 # Reduced from default 768
```
## Model Registry Configuration
Ensure your embedding models are registered in the model registry:
```yaml
models:
- model_id: all-MiniLM-L6-v2
provider_id: huggingface
provider_model_id: sentence-transformers/all-MiniLM-L6-v2
model_type: embedding
metadata:
embedding_dimension: 384
- model_id: nomic-embed-text
provider_id: ollama
provider_model_id: nomic-embed-text
model_type: embedding
metadata:
embedding_dimension: 768 # Default, can be overridden
- model_id: text-embedding-3-small
provider_id: openai
provider_model_id: text-embedding-3-small
model_type: embedding
metadata:
embedding_dimension: 1536 # Default for OpenAI model
```
## API Usage Examples
### Using Provider Defaults
```python
# Uses the embedding model configured in the provider config
vector_store = await vector_io.openai_create_vector_store(
name="documents", provider_id="faiss_store" # Will use configured embedding_model
)
```
### Explicit Override
```python
# Overrides provider defaults with explicit parameters
vector_store = await vector_io.openai_create_vector_store(
name="documents",
embedding_model="text-embedding-3-large", # Override provider default
embedding_dimension=1024, # Custom dimension
provider_id="faiss_store",
)
```
### Matryoshka Embedding Usage
```python
# Provider configured with nomic-embed-text and dimension 256
vector_store = await vector_io.openai_create_vector_store(
name="compact_embeddings", provider_id="flexible_search" # Uses Matryoshka config
)
# Or override with different dimension
vector_store = await vector_io.openai_create_vector_store(
name="full_embeddings",
embedding_dimension=768, # Use full dimension
provider_id="flexible_search",
)
```
## Migration Guide
### Updating Existing Configurations
Your existing configurations will continue to work without changes. To add provider-level defaults:
1. **Add embedding model fields** to your provider configs
2. **Test the configuration** to ensure expected behavior
3. **Remove explicit embedding_model parameters** from API calls if desired
### Before (explicit parameters required):
```python
# Had to specify embedding model every time
await vector_io.openai_create_vector_store(
name="store1", embedding_model="all-MiniLM-L6-v2"
)
```
### After (provider defaults):
```yaml
# Configure once in provider config
config:
embedding_model: "all-MiniLM-L6-v2"
```
```python
# No need to specify repeatedly
await vector_io.openai_create_vector_store(name="store1")
await vector_io.openai_create_vector_store(name="store2")
await vector_io.openai_create_vector_store(name="store3")
```
## Best Practices
### 1. Model Selection
- Use **lightweight models** (e.g., `all-MiniLM-L6-v2`) for simple semantic search
- Use **high-quality models** (e.g., `text-embedding-3-large`) for complex retrieval
- Consider **Matryoshka models** (e.g., `nomic-embed-text`) for flexible dimension requirements
### 2. Provider Configuration
- Configure embedding models at the **provider level** for consistency
- Use **environment variables** for API keys and sensitive configuration
- Set up **multiple providers** with different models for different use cases
### 3. Dimension Management
- Let the system **auto-lookup dimensions** when possible
- Only specify `embedding_dimension` for **Matryoshka embeddings** or custom requirements
- Ensure **model registry** has correct dimension metadata
### 4. Performance Optimization
- Use **smaller dimensions** for faster search (e.g., 256 instead of 768)
- Consider **multiple vector stores** with different embedding models for different content types
- Test **different embedding models** to find the best balance for your use case
## Troubleshooting
### Common Issues
**Model not found error:**
```
ValueError: Embedding model 'my-model' not found in model registry
```
**Solution:** Ensure the model is registered in your model configuration.
**Missing dimension metadata:**
```
ValueError: Embedding model 'my-model' has no embedding_dimension in metadata
```
**Solution:** Add `embedding_dimension` to the model's metadata in your model registry.
**Invalid dimension override:**
```
ValueError: Override dimension must be positive, got -1
```
**Solution:** Use positive integers for `embedding_dimension` values.
### Debugging Tips
1. **Check model registry:** Verify embedding models are properly registered
2. **Review provider config:** Ensure `embedding_model` matches registry IDs
3. **Test explicit parameters:** Override provider defaults to isolate issues
4. **Check logs:** Look for embedding model selection messages in router logs