7.1 KiB
Configurable Retrieval for RAG in Llama Stack
** Authors:**
- Red Hat: @varshaprasad96 @franciscojavierarceo
Summary
This RFC proposes expanding the RAG capabilities in Llama Stack to support keyword search, hybrid search, and other retrieval strategies through a backend configuration.
Motivation
The benefits of pre-retrieval optimization through indexing have been well studied (1, 2, 3) and it has been shown that keyword, vector, hybrid search, and other search strategies offer distinct benefits to information retrieval for RAG. Enabling Llama Stack to easily configure different search modes, while abstracting the implementation details offers significant value to Llama Stack users.
Scope
Goals:
- Create a design to support different search modes across supported databases.
Non-Goals:
- Multi-mode stacking: We will focus on a single selectable search mode per database.
Requirements:
To support keyword based searches (and consequently hybrid search), a query string is required. Therefore the query
method will need an additional parameter that is optional, we propose calling this new parameter query_string.
There are at least three different implementation options available:
- Configure the Search Mode per application when registering the Vector Provider and applying the same mode for all queries in the application.
- Configure the Search Mode per query when querying the Vector Provider and allowing for different search modes for each query in the application.
- Create a new
keyword_search_ioprovider and create a separate implementation for each provider.
A brief review of the pros and cons of the three options are provided in the table below.
Implementation Evaluation
Option 1: Static search mode applied for all queries
Pros:
- Easy to configure behind the scenes.
- No need for additional APIs or breaking changes.
Cons:
- Less flexible if users want to specify different queries for different options.
- Retaining the VectorIO naming convention is not intuitive for the codebase.
Option 2: Dynamic search mode per query
Pros:
- Allows queries to be flexible in their desired usage.
- No need for additional APIs or breaking changes.
Cons:
- Adds more complexity for API support.
- Potentially exposes more challenges when users are debugging queries.
- The database needs to be customized to store both embeddings and keywords. In certain implementations it could be a memory overhead.
- Complicates UX, the user should not have to care about which search mode they are using (e.g., users don’t configure their search for OpenAI).
- Unclear that there are use cases that this is a desirable parameter.
- Retaining the VectorIO naming convention is not intuitive for the codebase.
Option 3: Separate Provider and API
Pros:
- Allows maximum configuration.
Cons:
- Larger implementation scope.
- Requires providers to implement in 3 potential places (inline, remote, and keyword_search_io).
- Generalization to hybrid search would logically warrant an additional provider implementation (hybrid_search_io).
- Would duplicate a lot of boilerplate code (e.g., configuration of provider) across multiple databases depending on their range of support (listed in reference).
Recommendation and Proposal
Based on our review, the above pros and cons, and how other frameworks approach enabling hybrid and keyword search we
recommend:
Option 1: Allow Users to configure the Search Mode through a Provider Config field and an additional query_string
parameter in the API.
Implementation Detail:
We would extend the RAGQueryConfig in the to accept a mode parameter:
@json_schema_type
class RAGQueryConfig(BaseModel):
# This config defines how a query is generated using the messages
# for memory bank retrieval.
query_generator_config: RAGQueryGeneratorConfig = Field(
default=DefaultRAGQueryGeneratorConfig()
)
max_tokens_in_context: int = 4096
max_chunks: int = 5
mode: str
The Query API is modified to accept query_string and mode parameter within the EmbeddingIndex class:
class EmbeddingIndex(ABC):
@abstractmethod
async def add_chunks(self, chunks: List[Chunk], embeddings: NDArray):
raise NotImplementedError()
@abstractmethod
async def query(
self,
embedding: NDArray,
query_string: Optional[str],
k: int,
score_threshold: float,
mode: Optional[str],
) -> QueryChunksResponse:
raise NotImplementedError()
@abstractmethod
async def delete(self):
raise NotImplementedError()
Note: This change requires that all the implementations of the query API in the DBs need to be modified. It is also up to the provider to ensure that a valid mode is provided.
The implementation after exposing the option to configure mode would look like the code below.
Step 1:
With querying the database directly:
response = await sqlite_vec_index.query(
embedding=query_embedding,
query_string="",
k=top_k,
score_threshold=0.0,
mode="vector",
)
With RAGTool:
query_config = RAGQueryConfig(max_chunks=6, mode="vector").model_dump()
results = client.tool_runtime.rag_tool.query(
vector_db_ids=[vector_db_id], content="what is torchtune", query_config=query_config
)
Step 2:
The eventual goal would be to make this change at the DB registration step, such that the user needs to provide the search mode during query.
This change needs to be made in the llama-stack-client and then propagated to the server.
Benchmarking:
To evaluate the impact of configurable retrieval modes, we will benchmark the supported search strategies—keyword, vector, and hybrid—across multiple vector database backends, as support for each is implemented.
References:
Databases and their supported configurations in Llama Stack as of April 11, 2025.
| Database | Vector Search | Keyword Search | Hybrid Search |
|---|---|---|---|
| SQLite (inline) | Yes | Yes | Yes |
| FAISS (inline) | Yes | No | No |
| Chroma (inline, remote) | Yes | Yes | Yes |
| Weaviate (remote) | Yes | Yes | Yes |
| Qdrant (remote) | Yes | Yes | Yes |
| PGVector (remote) | Yes | No | No |
| Milvus (inline, remote) | Yes | Yes | Yes |