llama-stack-mirror/llama_stack
Ilya Kolchinsky e664ba91d8
fix: prevent the knowledge search tool from confusing the model with long content (#1908)
# What does this PR do?
This PR addresses the content dominance problem that frequently arises
with multiple models when executing queries with the RAG tool. When the
retrieved content is too large, it disproportionately influences the
generation process, causing the model to ignore the original question
and to provide meaningless comments on the retrieved information
instead.

This situation is especially common with agentic RAG, which is the
standard way of doing RAG in Llama Stack, since directly manipulating
the prompt combining the query with the retrieved content is not
possible.

This PR appends a grounding message to the results returned by the
knowledge search tool, reminding the model about the original query and
the purpose of the inference call. This makes the problem significantly
less likely to occur.

## Test Plan
Running the following script before the fix demonstrates the content
dominance problem where the model insists to comment on the retrieved
content and refuses to address the question.
Running the script after the fix results in getting the correct answer.
```
import os
import uuid

from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient

# the server endpoint
LLAMA_STACK_SERVER_URL = "http://localhost:8321"

# inference settings
MODEL_ID = ""meta-llama/Llama-3.1-8B-Instruct"
SYSTEM_PROMPT = "You are a helpful assistant. "

# RAG settings
VECTOR_DB_EMBEDDING_MODEL = "all-MiniLM-L6-v2"
VECTOR_DB_EMBEDDING_DIMENSION = 384
VECTOR_DB_CHUNK_SIZE = 512
    
# initialize the server connection
client = LlamaStackClient(base_url=os.environ.get("LLAMA_STACK_ENDPOINT", LLAMA_STACK_SERVER_URL))

# init the RAG retrieval parameters
vector_db_id = f"test_vector_db_{uuid.uuid4()}"
vector_providers = [
    provider for provider in client.providers.list() if provider.api == "vector_io"
]
vector_provider_to_use = vector_providers[0]

# define and register the document collection to be used
client.vector_dbs.register(
    vector_db_id=vector_db_id,
    embedding_model=VECTOR_DB_EMBEDDING_MODEL,
    embedding_dimension=VECTOR_DB_EMBEDDING_DIMENSION,
    provider_id=vector_provider_to_use.provider_id,
)

# ingest the documents into the newly created document collection
urls = [
    ("https://www.openshift.guide/openshift-guide-screen.pdf", "application/pdf"),
]
documents = [
    RAGDocument(
        document_id=f"num-{i}",
        content=url,
        mime_type=url_type,
        metadata={},
    )
    for i, (url, url_type) in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=VECTOR_DB_CHUNK_SIZE,
)

queries = [
    "How to install OpenShift?",
]

# initializing the agent
agent = Agent(
    client,
    model=MODEL_ID,
    instructions=SYSTEM_PROMPT,
    # we make our agent aware of the RAG tool by including builtin::rag/knowledge_search in the list of tools
    tools=[
        dict(
            name="builtin::rag/knowledge_search",
            args={
                "vector_db_ids": [vector_db_id],  # list of IDs of document collections to consider during retrieval
            },
        )
    ],
)

for prompt in queries:
    print(f"User> {prompt}")
    
    # create a new turn with a new session ID for each prompt
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=agent.create_session(f"rag-session_{uuid.uuid4()}")
    )
    
    # print the response, including tool calls output
    for log in AgentEventLogger().log(response):
        print(log.content, end='')
```
2025-04-24 16:38:38 +02:00
..
apis feat(agents): add agent naming functionality (#1922) 2025-04-17 07:02:47 -07:00
cli feat: include run.yaml in the container image (#2005) 2025-04-24 11:29:53 +02:00
distribution feat: include run.yaml in the container image (#2005) 2025-04-24 11:29:53 +02:00
models fix: OAI compat endpoint for meta reference inference provider (#1962) 2025-04-17 11:16:04 -07:00
providers fix: prevent the knowledge search tool from confusing the model with long content (#1908) 2025-04-24 16:38:38 +02:00
strong_typing chore: more mypy checks (ollama, vllm, ...) (#1777) 2025-04-01 17:12:39 +02:00
templates feat: Update NVIDIA to GA docs; remove notebook reference until ready (#1999) 2025-04-18 19:13:18 -04:00
__init__.py export LibraryClient 2024-12-13 12:08:00 -08:00
env.py refactor(test): move tools, evals, datasetio, scoring and post training tests (#1401) 2025-03-04 14:53:47 -08:00
log.py chore: Remove style tags from log formatter (#1808) 2025-03-27 10:18:21 -04:00
schema_utils.py fix: dont check protocol compliance for experimental methods 2025-04-12 16:26:32 -07:00