RFC: Configuring search modes for RAG

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
2025-08-03 17:29:01 +00:00 · 2025-04-11 12:50:25 -07:00 · 2025-04-11 12:50:25 -07:00 · 3e2c418524
commit 3e2c418524
parent 2a74f0db39
1 changed files with 186 additions and 0 deletions
--- a/rfcs/RFC-002-configurable-search-mode.md
+++ b/rfcs/RFC-002-configurable-search-mode.md
@ -0,0 +1,186 @@
 # Configurable Retrieval for RAG in Llama Stack
 ** Authors:**
 * Red Hat: @varshaprasad96 @franciscojavierarceo
 ## Summary
 This RFC proposes expanding the RAG capabilities in Llama Stack to support keyword search, hybrid search, and other
 retrieval strategies through a backend configuration.
 ## Motivation
 The benefits of pre-retrieval optimization through indexing have been well
 studied ([1][rag-ref-1], [2][rag-ref-2], [3][rag-ref-3]) and it has been shown that keyword, vector, hybrid search, and other search strategies offer distinct benefits to
 information retrieval for RAG. Enabling Llama Stack to easily configure different search modes, while abstracting the implementation
 details offers significant value to Llama Stack users.
 ## Scope
 ### Goals:
 1. Create a design to support different search modes across supported databases.
 ### Non-Goals:
 1. Multi-mode stacking: We will focus on a single selectable search mode per database.
 ## Requirements:
 To support `keyword` based searches (and consequently hybrid search), a query string is required. Therefore the `query`
 method will need an additional parameter that is optional, we propose calling this new parameter `query_string`.
 There are at least three different implementation options available:
 1. Configure the Search Mode per application when registering the Vector Provider and applying the same mode for all
 queries in the application.
 2. Configure the Search Mode per query when querying the Vector Provider and allowing for different search modes for each
 query in the application.
 3. Create a new `keyword_search_io` provider and create a separate implementation for each provider.
 A brief review of the pros and cons of the three options are provided in the table below.
 ### Implementation Evaluation
 ##### Option 1: Static search mode applied for all queries
 *Pros*:
 1. Easy to configure behind the scenes.
 2. No need for additional APIs or breaking changes.
 *Cons*:
 1. Less flexible if users want to specify different queries for different options.
 2. Retaining the VectorIO naming convention is not intuitive for the codebase.
 ##### Option 2: Dynamic search mode per query
 *Pros*:
 1. Allows queries to be flexible in their desired usage.
 2. No need for additional APIs or breaking changes.
 *Cons*:
 1. Adds more complexity for API support.
 2. Potentially exposes more challenges when users are debugging queries.
 3. The database needs to be customized to store both embeddings and keywords. In certain implementations it could be a
 memory overhead.
 4. Complicates UX, the user should not have to care about which search mode they are using (e.g., users don’t configure
 their search for OpenAI).
 5. Unclear that there are use cases that this is a desirable parameter.
 6. Retaining the VectorIO naming convention is not intuitive for the codebase.
 ##### Option 3: Separate Provider and API
 *Pros*:
 1. Allows maximum configuration.
 *Cons*:
 1. Larger implementation scope.
 2. Requires providers to implement in 3 potential places (inline, remote, and keyword_search_io).
 3. Generalization to hybrid search would logically warrant an additional provider implementation (hybrid_search_io).
 4. Would duplicate a lot of boilerplate code (e.g., configuration of provider) across multiple databases depending on
 their range of support (listed in reference).
 ## Recommendation and Proposal
 Based on our review, the above pros and cons, and how other frameworks approach enabling hybrid and keyword search we
 recommend:
 Option 1: Allow Users to configure the Search Mode through a Provider Config field and an additional `query_string`
 parameter in the API.
 ### Implementation Detail:
 We would extend the RAGQueryConfig in the to accept a `mode` parameter:
 ```python
@json_schema_type
 class RAGQueryConfig(BaseModel):
    # This config defines how a query is generated using the messages
    # for memory bank retrieval.
    query_generator_config: RAGQueryGeneratorConfig = Field(
        default=DefaultRAGQueryGeneratorConfig()
    )
    max_tokens_in_context: int = 4096
    max_chunks: int = 5
    mode: str
 ```
 The Query API is modified to accept `query_string` and `mode` parameter within the `EmbeddingIndex` class:
 ```python
 class EmbeddingIndex(ABC):
    @abstractmethod
    async def add_chunks(self, chunks: List[Chunk], embeddings: NDArray):
        raise NotImplementedError()
    @abstractmethod
    async def query(
        self,
        embedding: NDArray,
        query_string: Optional[str],
        k: int,
        score_threshold: float,
        mode: Optional[str],
    ) -> QueryChunksResponse:
        raise NotImplementedError()
    @abstractmethod
    async def delete(self):
        raise NotImplementedError()
 ```
 Note: This change requires that all the implementations of the query API in the DBs need to be modified. It is also up
 to the provider to ensure that a valid mode is provided.
 The implementation after exposing the option to configure `mode` would look like the code below.
 #### Step 1:
 With querying the database directly:
 ```python
 response = await sqlite_vec_index.query(
    embedding=query_embedding,
    query_string="",
    k=top_k,
    score_threshold=0.0,
    mode="vector",
 )
 ```
 With `RAGTool`:
 ```python
 query_config = RAGQueryConfig(max_chunks=6, mode="vector").model_dump()
 results = client.tool_runtime.rag_tool.query(
    vector_db_ids=[vector_db_id], content="what is torchtune", query_config=query_config
 )
 ```
 #### Step 2:
 The eventual goal would be to make this change at the [DB registration][DB_registration] step, such that the user needs
 to provide the search mode during query.
 This change needs to be made in the [llama-stack-client][ls_client] and then propagated to the server.
 ### Benchmarking:
 To evaluate the impact of configurable retrieval modes, we will benchmark the supported search strategies—keyword,
 vector, and hybrid—across multiple vector database backends, as support for each is implemented.
 #### References:
 ##### Databases and their supported configurations in Llama Stack as of April 11, 2025.
 | Database                | Vector Search | Keyword Search | Hybrid Search |
 |-------------------------|---------------|----------------|---------------|
 | SQLite (inline)         | Yes           | Yes            | Yes           |
 | FAISS (inline)          | Yes           | No             | No            |
 | Chroma (inline, remote) | Yes           | Yes            | Yes           |
 | Weaviate (remote)       | Yes           | Yes            | Yes           |
 | Qdrant (remote)         | Yes           | Yes            | Yes           |
 | PGVector (remote)       | Yes           | No             | No            |
 | Milvus (inline, remote) | Yes           | Yes            | Yes           |
 [rag-ref-1]: https://arxiv.org/pdf/2404.07220
 [rag-ref-2]: https://arxiv.org/pdf/2312.10997
 [rag-ref-3]: https://www.onlinescientificresearch.com/articles/optimizing-rag-with-hybrid-search-and-contextual-chunking.pdf
 [DB_registration]: https://github.com/meta-llama/llama-stack-client-python/blob/b664564fe1c4771a7872286d0c2ac96c47816939/src/llama_stack_client/resources/vector_dbs.py#L105
 [ls_client]: https://github.com/meta-llama/llama-stack-client-python/blob/main/src/llama_stack_client/resources/vector_dbs.py#L105