From f1fdc1bc23554dea84a85ad5e56389d96709c7b3 Mon Sep 17 00:00:00 2001 From: Kevin Cogan <44865890+kevincogan@users.noreply.github.com> Date: Mon, 10 Mar 2025 17:17:00 +0000 Subject: [PATCH] Create RFC-0003-reranker-API.md --- rfcs/rfcs/RFC-0003-reranker-API.md | 70 ++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 rfcs/rfcs/RFC-0003-reranker-API.md diff --git a/rfcs/rfcs/RFC-0003-reranker-API.md b/rfcs/rfcs/RFC-0003-reranker-API.md new file mode 100644 index 000000000..34ca16ae2 --- /dev/null +++ b/rfcs/rfcs/RFC-0003-reranker-API.md @@ -0,0 +1,70 @@ +# RFC: Llama-Stack Reranking for RAG Workflows + +**Status:** Draft +**Author:** Kevin Cogan +**Start Date:** 2025-02-24 + +## Table of Contents + +1. [Abstract](#abstract) +2. [Introduction](#introduction) +3. [Problem Statement](#problem-statement) +4. [Proposed Reranking Solution](#proposed-reranking-solution) + 4.1. [Extended API Endpoints](#extended-api-endpoints) +     4.1.1. [Enhanced Query Endpoint](#enhanced-query-endpoint) +     4.1.2. [Dedicated Rerank Endpoint](#dedicated-rerank-endpoint) + 4.2. [Data Models and Schemas](#data-models-and-schemas) + 4.3. [Implementation in the RAG Runtime](#implementation-in-the-rag-runtime) +     4.3.1. [Inline Reranking Integration](#inline-reranking-integration) +     4.3.2. [Reranker Service](#reranker-service) + 4.4. [Example Configuration and Request](#example-configuration-and-request) +5. [API Design Overview](#api-design-overview) +6. [Considerations and Tradeoffs](#considerations-and-tradeoffs) +7. [Conclusion](#conclusion) +8. [Approvals](#approvals) + + +## Abstract +This RFC proposes an enhancement to the Llama-Stack Retrieval-Augmented Generation (RAG) system through a configurable reranking component. Many enterprise users rely on legacy keyword search systems that already have significant investments in content synchronization and indexing. In these cases, re-ranking can improve accuracy by refining search results without requiring a full transition to a vector-based retrieval system. By incorporating an additional scoring step—using either a remote inference API or a self-hosted model—the system enhances document retrieval, providing more precise context for downstream tasks. Users have the flexibility to enable or disable reranking and to select a reranker from remote providers (e.g., LlamaRank, Voyage AI, Cohere) or self-hosted models (e.g., sentence-transformers, LLM-based inference). Additionally, telemetry updates are integrated to capture and report reranking metrics for enhanced observability and performance tuning. +By incorporating an additional scoring step—using either a remote inference API or a self-hosted model—the system enhances document retrieval, providing more precise context for downstream tasks. Users have the flexibility to enable or disable reranking and to select a reranker from remote providers (e.g., LlamaRank, Voyage AI, Cohere) or self-hosted models (e.g., sentence-transformers, LLM-based inference). Additionally, telemetry updates are integrated to capture and report reranking metrics for enhanced observability and performance tuning. + +## Introduction +Current RAG implementations use embedding-based similarity search to retrieve document candidates; however, the preliminary ordering can be suboptimal for ambiguous or complex queries. For enterprise users who rely on keyword-based search systems, re-ranking can be especially impactful, as it enhances accuracy without requiring a full migration to vector search. This document outlines an approach that provides both API-based reranking and inline reranking, ensuring seamless integration with existing retrieval systems while emphasizing configurability, ease of implementation, and robust telemetry reporting. + +## Problem Statement +Existing RAG systems efficiently index and retrieve document chunks from vector stores, but they often lack a mechanism to refine initial results. This can lead to suboptimal context for LLMs and hinder overall performance. The case for re-ranking is especially strong for enterprise users relying on legacy keyword search systems, where significant investments have already been made in content synchronization and indexing. In these environments, re-ranking can substantially improve accuracy by refining outputs from established search infrastructure. While new vector stores using state-of-the-art dense models also benefit from re-ranking, the improvements tend to be less pronounced and may not justify the additional complexity and latency. Moreover, different operational needs mean that some users prefer a managed API solution, while others require inline control for low latency or data privacy. + +## Proposed Reranking Solution +![My Image](https://drive.google.com/uc?id=115BSpFE3UBmEk7ven5Jq4H7EZd6dMD6U) + +## 4.1. Extended API Endpoints + +### 4.1.1. Query Endpoint + +The `/tool-runtime/rag-tool/query` endpoint will be updated to accept three additional parameters: + +- `rerank` (boolean): Enables the reranking process when set to `true`. +- `reranker_model_id` (string): Specifies the reranking provider or model (e.g., `"sentence-transformers"` or self-hosted models). +- `rerank_config` (`Optional[RAGRerankConfig]`): Configures additional options for the reranking process (e.g., `is_external_reranker`, `api_url`, `api_key`). + +The updated endpoint interface is as follows: + +```python +@runtime_checkable +@trace_protocol +class RAGToolRuntime(Protocol): + @webmethod(route="/tool-runtime/rag-tool/query", method="POST") + async def query( + self, + content: InterleavedContent, + vector_db_ids: List[str], + query_config: Optional[RAGQueryConfig] = None, + rerank: bool = False, + reranker_model_id: str = "my_model_id", + rerank_config: Optional[RAGRerankConfig] = None + ) -> RAGQueryResult: + ... +``` +> **Note:** Note: When rerank is enabled, the service will invoke the reranking process using the specified reranker_model_id and additional options defined in rerank_config. + +