diff --git a/rfcs/RFC-002-configurable-search-mode.md b/rfcs/RFC-002-configurable-search-mode.md new file mode 100644 index 000000000..dc0452973 --- /dev/null +++ b/rfcs/RFC-002-configurable-search-mode.md @@ -0,0 +1,186 @@ +# Configurable Retrieval for RAG in Llama Stack + +** Authors:** + +* Red Hat: @varshaprasad96 @franciscojavierarceo + +## Summary + +This RFC proposes expanding the RAG capabilities in Llama Stack to support keyword search, hybrid search, and other +retrieval strategies through a backend configuration. + +## Motivation + +The benefits of pre-retrieval optimization through indexing have been well +studied ([1][rag-ref-1], [2][rag-ref-2], [3][rag-ref-3]) and it has been shown that keyword, vector, hybrid search, and other search strategies offer distinct benefits to +information retrieval for RAG. Enabling Llama Stack to easily configure different search modes, while abstracting the implementation +details offers significant value to Llama Stack users. + +## Scope + +### Goals: + +1. Create a design to support different search modes across supported databases. + +### Non-Goals: + +1. Multi-mode stacking: We will focus on a single selectable search mode per database. + +## Requirements: + +To support `keyword` based searches (and consequently hybrid search), a query string is required. Therefore the `query` +method will need an additional parameter that is optional, we propose calling this new parameter `query_string`. + +There are at least three different implementation options available: +1. Configure the Search Mode per application when registering the Vector Provider and applying the same mode for all +queries in the application. +2. Configure the Search Mode per query when querying the Vector Provider and allowing for different search modes for each +query in the application. +3. Create a new `keyword_search_io` provider and create a separate implementation for each provider. + +A brief review of the pros and cons of the three options are provided in the table below. + +### Implementation Evaluation + +##### Option 1: Static search mode applied for all queries +*Pros*: +1. Easy to configure behind the scenes. +2. No need for additional APIs or breaking changes. + +*Cons*: +1. Less flexible if users want to specify different queries for different options. +2. Retaining the VectorIO naming convention is not intuitive for the codebase. + +##### Option 2: Dynamic search mode per query +*Pros*: +1. Allows queries to be flexible in their desired usage. +2. No need for additional APIs or breaking changes. + +*Cons*: +1. Adds more complexity for API support. +2. Potentially exposes more challenges when users are debugging queries. +3. The database needs to be customized to store both embeddings and keywords. In certain implementations it could be a +memory overhead. +4. Complicates UX, the user should not have to care about which search mode they are using (e.g., users don’t configure +their search for OpenAI). +5. Unclear that there are use cases that this is a desirable parameter. +6. Retaining the VectorIO naming convention is not intuitive for the codebase. + +##### Option 3: Separate Provider and API +*Pros*: +1. Allows maximum configuration. + +*Cons*: +1. Larger implementation scope. +2. Requires providers to implement in 3 potential places (inline, remote, and keyword_search_io). +3. Generalization to hybrid search would logically warrant an additional provider implementation (hybrid_search_io). +4. Would duplicate a lot of boilerplate code (e.g., configuration of provider) across multiple databases depending on +their range of support (listed in reference). + +## Recommendation and Proposal + +Based on our review, the above pros and cons, and how other frameworks approach enabling hybrid and keyword search we +recommend: +Option 1: Allow Users to configure the Search Mode through a Provider Config field and an additional `query_string` +parameter in the API. + +### Implementation Detail: + +We would extend the RAGQueryConfig in the to accept a `mode` parameter: + +```python +@json_schema_type +class RAGQueryConfig(BaseModel): + # This config defines how a query is generated using the messages + # for memory bank retrieval. + query_generator_config: RAGQueryGeneratorConfig = Field( + default=DefaultRAGQueryGeneratorConfig() + ) + max_tokens_in_context: int = 4096 + max_chunks: int = 5 + mode: str +``` + +The Query API is modified to accept `query_string` and `mode` parameter within the `EmbeddingIndex` class: + +```python +class EmbeddingIndex(ABC): + @abstractmethod + async def add_chunks(self, chunks: List[Chunk], embeddings: NDArray): + raise NotImplementedError() + + @abstractmethod + async def query( + self, + embedding: NDArray, + query_string: Optional[str], + k: int, + score_threshold: float, + mode: Optional[str], + ) -> QueryChunksResponse: + raise NotImplementedError() + + @abstractmethod + async def delete(self): + raise NotImplementedError() +``` + +Note: This change requires that all the implementations of the query API in the DBs need to be modified. It is also up +to the provider to ensure that a valid mode is provided. + +The implementation after exposing the option to configure `mode` would look like the code below. + +#### Step 1: + +With querying the database directly: + +```python +response = await sqlite_vec_index.query( + embedding=query_embedding, + query_string="", + k=top_k, + score_threshold=0.0, + mode="vector", +) +``` + +With `RAGTool`: + +```python +query_config = RAGQueryConfig(max_chunks=6, mode="vector").model_dump() +results = client.tool_runtime.rag_tool.query( + vector_db_ids=[vector_db_id], content="what is torchtune", query_config=query_config +) +``` + +#### Step 2: + +The eventual goal would be to make this change at the [DB registration][DB_registration] step, such that the user needs +to provide the search mode during query. + +This change needs to be made in the [llama-stack-client][ls_client] and then propagated to the server. + +### Benchmarking: + +To evaluate the impact of configurable retrieval modes, we will benchmark the supported search strategies—keyword, +vector, and hybrid—across multiple vector database backends, as support for each is implemented. + +#### References: + +##### Databases and their supported configurations in Llama Stack as of April 11, 2025. + +| Database | Vector Search | Keyword Search | Hybrid Search | +|-------------------------|---------------|----------------|---------------| +| SQLite (inline) | Yes | Yes | Yes | +| FAISS (inline) | Yes | No | No | +| Chroma (inline, remote) | Yes | Yes | Yes | +| Weaviate (remote) | Yes | Yes | Yes | +| Qdrant (remote) | Yes | Yes | Yes | +| PGVector (remote) | Yes | No | No | +| Milvus (inline, remote) | Yes | Yes | Yes | + +[rag-ref-1]: https://arxiv.org/pdf/2404.07220 +[rag-ref-2]: https://arxiv.org/pdf/2312.10997 +[rag-ref-3]: https://www.onlinescientificresearch.com/articles/optimizing-rag-with-hybrid-search-and-contextual-chunking.pdf +[DB_registration]: https://github.com/meta-llama/llama-stack-client-python/blob/b664564fe1c4771a7872286d0c2ac96c47816939/src/llama_stack_client/resources/vector_dbs.py#L105 +[ls_client]: https://github.com/meta-llama/llama-stack-client-python/blob/main/src/llama_stack_client/resources/vector_dbs.py#L105