mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-28 02:22:00 +00:00

Varsha Prasad Narsing 3e2c418524 RFC: Configuring search modes for RAG

Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>

2025-04-21 10:10:33 -07:00

7.1 KiB

Raw Blame History

Configurable Retrieval for RAG in Llama Stack

** Authors:**

Red Hat: @varshaprasad96 @franciscojavierarceo

Summary

This RFC proposes expanding the RAG capabilities in Llama Stack to support keyword search, hybrid search, and other retrieval strategies through a backend configuration.

Motivation

The benefits of pre-retrieval optimization through indexing have been well studied (1, 2, 3) and it has been shown that keyword, vector, hybrid search, and other search strategies offer distinct benefits to information retrieval for RAG. Enabling Llama Stack to easily configure different search modes, while abstracting the implementation details offers significant value to Llama Stack users.

Scope

Goals:

Create a design to support different search modes across supported databases.

Non-Goals:

Multi-mode stacking: We will focus on a single selectable search mode per database.

Requirements:

To support keyword based searches (and consequently hybrid search), a query string is required. Therefore the query method will need an additional parameter that is optional, we propose calling this new parameter query_string.

There are at least three different implementation options available:

Configure the Search Mode per application when registering the Vector Provider and applying the same mode for all queries in the application.
Configure the Search Mode per query when querying the Vector Provider and allowing for different search modes for each query in the application.
Create a new keyword_search_io provider and create a separate implementation for each provider.

A brief review of the pros and cons of the three options are provided in the table below.

Implementation Evaluation

Option 1: Static search mode applied for all queries

Pros:

Easy to configure behind the scenes.
No need for additional APIs or breaking changes.

Cons:

Less flexible if users want to specify different queries for different options.
Retaining the VectorIO naming convention is not intuitive for the codebase.

Option 2: Dynamic search mode per query

Pros:

Allows queries to be flexible in their desired usage.
No need for additional APIs or breaking changes.

Cons:

Adds more complexity for API support.
Potentially exposes more challenges when users are debugging queries.
The database needs to be customized to store both embeddings and keywords. In certain implementations it could be a memory overhead.
Complicates UX, the user should not have to care about which search mode they are using (e.g., users don’t configure their search for OpenAI).
Unclear that there are use cases that this is a desirable parameter.
Retaining the VectorIO naming convention is not intuitive for the codebase.

Option 3: Separate Provider and API

Pros:

Allows maximum configuration.

Cons:

Larger implementation scope.
Requires providers to implement in 3 potential places (inline, remote, and keyword_search_io).
Generalization to hybrid search would logically warrant an additional provider implementation (hybrid_search_io).
Would duplicate a lot of boilerplate code (e.g., configuration of provider) across multiple databases depending on their range of support (listed in reference).

Recommendation and Proposal

Based on our review, the above pros and cons, and how other frameworks approach enabling hybrid and keyword search we recommend: Option 1: Allow Users to configure the Search Mode through a Provider Config field and an additional query_string parameter in the API.

Implementation Detail:

We would extend the RAGQueryConfig in the to accept a mode parameter:

@json_schema_type
class RAGQueryConfig(BaseModel):
    # This config defines how a query is generated using the messages
    # for memory bank retrieval.
    query_generator_config: RAGQueryGeneratorConfig = Field(
        default=DefaultRAGQueryGeneratorConfig()
    )
    max_tokens_in_context: int = 4096
    max_chunks: int = 5
    mode: str

The Query API is modified to accept query_string and mode parameter within the EmbeddingIndex class:

class EmbeddingIndex(ABC):
    @abstractmethod
    async def add_chunks(self, chunks: List[Chunk], embeddings: NDArray):
        raise NotImplementedError()

    @abstractmethod
    async def query(
        self,
        embedding: NDArray,
        query_string: Optional[str],
        k: int,
        score_threshold: float,
        mode: Optional[str],
    ) -> QueryChunksResponse:
        raise NotImplementedError()

    @abstractmethod
    async def delete(self):
        raise NotImplementedError()

Note: This change requires that all the implementations of the query API in the DBs need to be modified. It is also up to the provider to ensure that a valid mode is provided.

The implementation after exposing the option to configure mode would look like the code below.

Step 1:

With querying the database directly:

response = await sqlite_vec_index.query(
    embedding=query_embedding,
    query_string="",
    k=top_k,
    score_threshold=0.0,
    mode="vector",
)

With RAGTool:

query_config = RAGQueryConfig(max_chunks=6, mode="vector").model_dump()
results = client.tool_runtime.rag_tool.query(
    vector_db_ids=[vector_db_id], content="what is torchtune", query_config=query_config
)

Step 2:

The eventual goal would be to make this change at the DB registration step, such that the user needs to provide the search mode during query.

This change needs to be made in the llama-stack-client and then propagated to the server.

Benchmarking:

To evaluate the impact of configurable retrieval modes, we will benchmark the supported search strategies—keyword, vector, and hybrid—across multiple vector database backends, as support for each is implemented.

References:

Databases and their supported configurations in Llama Stack as of April 11, 2025.

Database	Vector Search	Keyword Search	Hybrid Search
SQLite (inline)	Yes	Yes	Yes
FAISS (inline)	Yes	No	No
Chroma (inline, remote)	Yes	Yes	Yes
Weaviate (remote)	Yes	Yes	Yes
Qdrant (remote)	Yes	Yes	Yes
PGVector (remote)	Yes	No	No
Milvus (inline, remote)	Yes	Yes	Yes

7.1 KiB Raw Blame History Unescape Escape

Configurable Retrieval for RAG in Llama Stack

Summary

Motivation

Scope

Goals:

Non-Goals:

Requirements:

Implementation Evaluation

Option 1: Static search mode applied for all queries

Option 2: Dynamic search mode per query

Option 3: Separate Provider and API

Recommendation and Proposal

Implementation Detail:

Step 1:

Step 2:

Benchmarking:

References:

Databases and their supported configurations in Llama Stack as of April 11, 2025.

7.1 KiB

Raw Blame History