mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-08-03 17:29:01 +00:00
RFC: Configuring search modes for RAG
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
This commit is contained in:
parent
2a74f0db39
commit
3e2c418524
1 changed files with 186 additions and 0 deletions
186
rfcs/RFC-002-configurable-search-mode.md
Normal file
186
rfcs/RFC-002-configurable-search-mode.md
Normal file
|
@ -0,0 +1,186 @@
|
||||||
|
# Configurable Retrieval for RAG in Llama Stack
|
||||||
|
|
||||||
|
** Authors:**
|
||||||
|
|
||||||
|
* Red Hat: @varshaprasad96 @franciscojavierarceo
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
This RFC proposes expanding the RAG capabilities in Llama Stack to support keyword search, hybrid search, and other
|
||||||
|
retrieval strategies through a backend configuration.
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
The benefits of pre-retrieval optimization through indexing have been well
|
||||||
|
studied ([1][rag-ref-1], [2][rag-ref-2], [3][rag-ref-3]) and it has been shown that keyword, vector, hybrid search, and other search strategies offer distinct benefits to
|
||||||
|
information retrieval for RAG. Enabling Llama Stack to easily configure different search modes, while abstracting the implementation
|
||||||
|
details offers significant value to Llama Stack users.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
### Goals:
|
||||||
|
|
||||||
|
1. Create a design to support different search modes across supported databases.
|
||||||
|
|
||||||
|
### Non-Goals:
|
||||||
|
|
||||||
|
1. Multi-mode stacking: We will focus on a single selectable search mode per database.
|
||||||
|
|
||||||
|
## Requirements:
|
||||||
|
|
||||||
|
To support `keyword` based searches (and consequently hybrid search), a query string is required. Therefore the `query`
|
||||||
|
method will need an additional parameter that is optional, we propose calling this new parameter `query_string`.
|
||||||
|
|
||||||
|
There are at least three different implementation options available:
|
||||||
|
1. Configure the Search Mode per application when registering the Vector Provider and applying the same mode for all
|
||||||
|
queries in the application.
|
||||||
|
2. Configure the Search Mode per query when querying the Vector Provider and allowing for different search modes for each
|
||||||
|
query in the application.
|
||||||
|
3. Create a new `keyword_search_io` provider and create a separate implementation for each provider.
|
||||||
|
|
||||||
|
A brief review of the pros and cons of the three options are provided in the table below.
|
||||||
|
|
||||||
|
### Implementation Evaluation
|
||||||
|
|
||||||
|
##### Option 1: Static search mode applied for all queries
|
||||||
|
*Pros*:
|
||||||
|
1. Easy to configure behind the scenes.
|
||||||
|
2. No need for additional APIs or breaking changes.
|
||||||
|
|
||||||
|
*Cons*:
|
||||||
|
1. Less flexible if users want to specify different queries for different options.
|
||||||
|
2. Retaining the VectorIO naming convention is not intuitive for the codebase.
|
||||||
|
|
||||||
|
##### Option 2: Dynamic search mode per query
|
||||||
|
*Pros*:
|
||||||
|
1. Allows queries to be flexible in their desired usage.
|
||||||
|
2. No need for additional APIs or breaking changes.
|
||||||
|
|
||||||
|
*Cons*:
|
||||||
|
1. Adds more complexity for API support.
|
||||||
|
2. Potentially exposes more challenges when users are debugging queries.
|
||||||
|
3. The database needs to be customized to store both embeddings and keywords. In certain implementations it could be a
|
||||||
|
memory overhead.
|
||||||
|
4. Complicates UX, the user should not have to care about which search mode they are using (e.g., users don’t configure
|
||||||
|
their search for OpenAI).
|
||||||
|
5. Unclear that there are use cases that this is a desirable parameter.
|
||||||
|
6. Retaining the VectorIO naming convention is not intuitive for the codebase.
|
||||||
|
|
||||||
|
##### Option 3: Separate Provider and API
|
||||||
|
*Pros*:
|
||||||
|
1. Allows maximum configuration.
|
||||||
|
|
||||||
|
*Cons*:
|
||||||
|
1. Larger implementation scope.
|
||||||
|
2. Requires providers to implement in 3 potential places (inline, remote, and keyword_search_io).
|
||||||
|
3. Generalization to hybrid search would logically warrant an additional provider implementation (hybrid_search_io).
|
||||||
|
4. Would duplicate a lot of boilerplate code (e.g., configuration of provider) across multiple databases depending on
|
||||||
|
their range of support (listed in reference).
|
||||||
|
|
||||||
|
## Recommendation and Proposal
|
||||||
|
|
||||||
|
Based on our review, the above pros and cons, and how other frameworks approach enabling hybrid and keyword search we
|
||||||
|
recommend:
|
||||||
|
Option 1: Allow Users to configure the Search Mode through a Provider Config field and an additional `query_string`
|
||||||
|
parameter in the API.
|
||||||
|
|
||||||
|
### Implementation Detail:
|
||||||
|
|
||||||
|
We would extend the RAGQueryConfig in the to accept a `mode` parameter:
|
||||||
|
|
||||||
|
```python
|
||||||
|
@json_schema_type
|
||||||
|
class RAGQueryConfig(BaseModel):
|
||||||
|
# This config defines how a query is generated using the messages
|
||||||
|
# for memory bank retrieval.
|
||||||
|
query_generator_config: RAGQueryGeneratorConfig = Field(
|
||||||
|
default=DefaultRAGQueryGeneratorConfig()
|
||||||
|
)
|
||||||
|
max_tokens_in_context: int = 4096
|
||||||
|
max_chunks: int = 5
|
||||||
|
mode: str
|
||||||
|
```
|
||||||
|
|
||||||
|
The Query API is modified to accept `query_string` and `mode` parameter within the `EmbeddingIndex` class:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class EmbeddingIndex(ABC):
|
||||||
|
@abstractmethod
|
||||||
|
async def add_chunks(self, chunks: List[Chunk], embeddings: NDArray):
|
||||||
|
raise NotImplementedError()
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def query(
|
||||||
|
self,
|
||||||
|
embedding: NDArray,
|
||||||
|
query_string: Optional[str],
|
||||||
|
k: int,
|
||||||
|
score_threshold: float,
|
||||||
|
mode: Optional[str],
|
||||||
|
) -> QueryChunksResponse:
|
||||||
|
raise NotImplementedError()
|
||||||
|
|
||||||
|
@abstractmethod
|
||||||
|
async def delete(self):
|
||||||
|
raise NotImplementedError()
|
||||||
|
```
|
||||||
|
|
||||||
|
Note: This change requires that all the implementations of the query API in the DBs need to be modified. It is also up
|
||||||
|
to the provider to ensure that a valid mode is provided.
|
||||||
|
|
||||||
|
The implementation after exposing the option to configure `mode` would look like the code below.
|
||||||
|
|
||||||
|
#### Step 1:
|
||||||
|
|
||||||
|
With querying the database directly:
|
||||||
|
|
||||||
|
```python
|
||||||
|
response = await sqlite_vec_index.query(
|
||||||
|
embedding=query_embedding,
|
||||||
|
query_string="",
|
||||||
|
k=top_k,
|
||||||
|
score_threshold=0.0,
|
||||||
|
mode="vector",
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
With `RAGTool`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
query_config = RAGQueryConfig(max_chunks=6, mode="vector").model_dump()
|
||||||
|
results = client.tool_runtime.rag_tool.query(
|
||||||
|
vector_db_ids=[vector_db_id], content="what is torchtune", query_config=query_config
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Step 2:
|
||||||
|
|
||||||
|
The eventual goal would be to make this change at the [DB registration][DB_registration] step, such that the user needs
|
||||||
|
to provide the search mode during query.
|
||||||
|
|
||||||
|
This change needs to be made in the [llama-stack-client][ls_client] and then propagated to the server.
|
||||||
|
|
||||||
|
### Benchmarking:
|
||||||
|
|
||||||
|
To evaluate the impact of configurable retrieval modes, we will benchmark the supported search strategies—keyword,
|
||||||
|
vector, and hybrid—across multiple vector database backends, as support for each is implemented.
|
||||||
|
|
||||||
|
#### References:
|
||||||
|
|
||||||
|
##### Databases and their supported configurations in Llama Stack as of April 11, 2025.
|
||||||
|
|
||||||
|
| Database | Vector Search | Keyword Search | Hybrid Search |
|
||||||
|
|-------------------------|---------------|----------------|---------------|
|
||||||
|
| SQLite (inline) | Yes | Yes | Yes |
|
||||||
|
| FAISS (inline) | Yes | No | No |
|
||||||
|
| Chroma (inline, remote) | Yes | Yes | Yes |
|
||||||
|
| Weaviate (remote) | Yes | Yes | Yes |
|
||||||
|
| Qdrant (remote) | Yes | Yes | Yes |
|
||||||
|
| PGVector (remote) | Yes | No | No |
|
||||||
|
| Milvus (inline, remote) | Yes | Yes | Yes |
|
||||||
|
|
||||||
|
[rag-ref-1]: https://arxiv.org/pdf/2404.07220
|
||||||
|
[rag-ref-2]: https://arxiv.org/pdf/2312.10997
|
||||||
|
[rag-ref-3]: https://www.onlinescientificresearch.com/articles/optimizing-rag-with-hybrid-search-and-contextual-chunking.pdf
|
||||||
|
[DB_registration]: https://github.com/meta-llama/llama-stack-client-python/blob/b664564fe1c4771a7872286d0c2ac96c47816939/src/llama_stack_client/resources/vector_dbs.py#L105
|
||||||
|
[ls_client]: https://github.com/meta-llama/llama-stack-client-python/blob/main/src/llama_stack_client/resources/vector_dbs.py#L105
|
Loading…
Add table
Add a link
Reference in a new issue