# What does this PR do?
If a user has previously serialized data into their vector store without
the `metadata_token_count` in the chunk, the `query` method will fail in
a server error. This fixes that edge case by returning 0 when the key is
not detected. This solution is suboptimal but I think it's better to
understate the token size rather than recalculate it and add unnecessary
complexity to the retrieval code.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
his PR allows users to customize the template used for chunks when
inserted into the context. Additionally, this enables metadata injection
into the context of an LLM for RAG. This makes a naive and crude
assumption that each chunk should include the metadata, this is
obviously redundant when multiple chunks are returned from the same
document. In order to remove any sort of duplication of chunks, we'd
have to make much more significant changes so this is a reasonable first
step that unblocks users requesting this enhancement in
https://github.com/meta-llama/llama-stack/issues/1767.
In the future, this can be extended to support citations.
List of Changes:
- `llama_stack/apis/tools/rag_tool.py`
- Added `chunk_template` field in `RAGQueryConfig`.
- Added `field_validator` to validate the `chunk_template` field in
`RAGQueryConfig`.
- Ensured the `chunk_template` field includes placeholders `{index}` and
`{chunk.content}`.
- Updated the `query` method to use the `chunk_template` for formatting
chunk text content.
- `llama_stack/providers/inline/tool_runtime/rag/memory.py`
- Modified the `insert` method to pass `doc.metadata` for chunk
creation.
- Enhanced the `query` method to format results using `chunk_template`
and exclude unnecessary metadata fields like `token_count`.
- `llama_stack/providers/utils/memory/vector_store.py`
- Updated `make_overlapped_chunks` to include metadata serialization and
token count for both content and metadata.
- Added error handling for metadata serialization issues.
- `pyproject.toml`
- Added `pydantic.field_validator` as a recognized `classmethod`
decorator in the linting configuration.
- `tests/integration/tool_runtime/test_rag_tool.py`
- Refactored test assertions to separate `assert_valid_chunk_response`
and `assert_valid_text_response`.
- Added integration tests to validate `chunk_template` functionality
with and without metadata inclusion.
- Included a test case to ensure `chunk_template` validation errors are
raised appropriately.
- `tests/unit/rag/test_vector_store.py`
- Added unit tests for `make_overlapped_chunks`, verifying chunk
creation with overlapping tokens and metadata integrity.
- Added tests to handle metadata serialization errors, ensuring proper
exception handling.
- `docs/_static/llama-stack-spec.html`
- Added a new `chunk_template` field of type `string` with a default
template for formatting retrieved chunks in RAGQueryConfig.
- Updated the `required` fields to include `chunk_template`.
- `docs/_static/llama-stack-spec.yaml`
- Introduced `chunk_template` field with a default value for
RAGQueryConfig.
- Updated the required configuration list to include `chunk_template`.
- `docs/source/building_applications/rag.md`
- Documented the `chunk_template` configuration, explaining how to
customize metadata formatting in RAG queries.
- Added examples demonstrating the usage of the `chunk_template` field
in RAG tool queries.
- Highlighted default values for `RAG` agent configurations.
# Resolves https://github.com/meta-llama/llama-stack/issues/1767
## Test Plan
Updated both `test_vector_store.py` and `test_rag_tool.py` and tested
end-to-end with a script.
I also tested the quickstart to enable this and specified this metadata:
```python
document = RAGDocument(
document_id="document_1",
content=source,
mime_type="text/html",
metadata={"author": "Paul Graham", "title": "How to do great work"},
)
```
Which produced the output below:

This highlights the usefulness of the additional metadata. Notice how
the metadata is redundant for different chunks of the same document. I
think we can update that in a subsequent PR.
# Documentation
I've added a brief comment about this in the documentation to outline
this to users and updated the API documentation.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
This PR fixes the behavior of the `/tool-runtime/rag-tool/query`
endpoint when invoked with an empty `vector_db_ids` parameter.
As of now, it simply returns an empty result, which leads to a
misleading error message from the server and makes it difficult and
time-consuming to detect the problem with the input parameter.
The proposed fix is to return an indicative error message in this case.
## Test Plan
Running the following script:
```
agent = Agent(
client,
model=MODEL_ID,
instructions=SYSTEM_PROMPT,
tools=[
dict(
name="builtin::rag/knowledge_search",
args={
"vector_db_ids": [],
},
)
],
)
response = agent.create_turn(
messages=[
{
"role": "user",
"content": "How to install OpenShift?",
}
],
session_id=agent.create_session(f"rag-session")
)
```
results in the following error message in the non-patched version:
```
{"type": "function", "name": "knowledge_search", "parameters": {"query": "installing OpenShift"}}400: Invalid value: Tool call result (id: 494b8020-90bb-449b-aa76-10960d6b2cc2, name: knowledge_search) does not have any content
```
and in the following one in the patched version:
```
{"type": "function", "name": "knowledge_search", "parameters": {"query": "installing OpenShift"}}400: Invalid value: No vector DBs were provided to the RAG tool. Please provide at least one DB.
```
# What does this PR do?
The goal of this PR is code base modernization.
Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)
Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
This PR addresses the content dominance problem that frequently arises
with multiple models when executing queries with the RAG tool. When the
retrieved content is too large, it disproportionately influences the
generation process, causing the model to ignore the original question
and to provide meaningless comments on the retrieved information
instead.
This situation is especially common with agentic RAG, which is the
standard way of doing RAG in Llama Stack, since directly manipulating
the prompt combining the query with the retrieved content is not
possible.
This PR appends a grounding message to the results returned by the
knowledge search tool, reminding the model about the original query and
the purpose of the inference call. This makes the problem significantly
less likely to occur.
## Test Plan
Running the following script before the fix demonstrates the content
dominance problem where the model insists to comment on the retrieved
content and refuses to address the question.
Running the script after the fix results in getting the correct answer.
```
import os
import uuid
from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient
# the server endpoint
LLAMA_STACK_SERVER_URL = "http://localhost:8321"
# inference settings
MODEL_ID = ""meta-llama/Llama-3.1-8B-Instruct"
SYSTEM_PROMPT = "You are a helpful assistant. "
# RAG settings
VECTOR_DB_EMBEDDING_MODEL = "all-MiniLM-L6-v2"
VECTOR_DB_EMBEDDING_DIMENSION = 384
VECTOR_DB_CHUNK_SIZE = 512
# initialize the server connection
client = LlamaStackClient(base_url=os.environ.get("LLAMA_STACK_ENDPOINT", LLAMA_STACK_SERVER_URL))
# init the RAG retrieval parameters
vector_db_id = f"test_vector_db_{uuid.uuid4()}"
vector_providers = [
provider for provider in client.providers.list() if provider.api == "vector_io"
]
vector_provider_to_use = vector_providers[0]
# define and register the document collection to be used
client.vector_dbs.register(
vector_db_id=vector_db_id,
embedding_model=VECTOR_DB_EMBEDDING_MODEL,
embedding_dimension=VECTOR_DB_EMBEDDING_DIMENSION,
provider_id=vector_provider_to_use.provider_id,
)
# ingest the documents into the newly created document collection
urls = [
("https://www.openshift.guide/openshift-guide-screen.pdf", "application/pdf"),
]
documents = [
RAGDocument(
document_id=f"num-{i}",
content=url,
mime_type=url_type,
metadata={},
)
for i, (url, url_type) in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=VECTOR_DB_CHUNK_SIZE,
)
queries = [
"How to install OpenShift?",
]
# initializing the agent
agent = Agent(
client,
model=MODEL_ID,
instructions=SYSTEM_PROMPT,
# we make our agent aware of the RAG tool by including builtin::rag/knowledge_search in the list of tools
tools=[
dict(
name="builtin::rag/knowledge_search",
args={
"vector_db_ids": [vector_db_id], # list of IDs of document collections to consider during retrieval
},
)
],
)
for prompt in queries:
print(f"User> {prompt}")
# create a new turn with a new session ID for each prompt
response = agent.create_turn(
messages=[
{
"role": "user",
"content": prompt,
}
],
session_id=agent.create_session(f"rag-session_{uuid.uuid4()}")
)
# print the response, including tool calls output
for log in AgentEventLogger().log(response):
print(log.content, end='')
```
# What does this PR do?
Don't return list for runtime tools. Instead return Response object for
pagination and consistency with other APIs.
---------
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
Clean up mypy violations for inline::{telemetry,tool_runtime,vector_io}.
This also makes API accept a tool call result without any content (like
RAG tool already may produce).
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Summary:
Lets the model decide which tool it needs to call to respond to a query.
Test Plan:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/ --safety-shield meta-llama/Llama-Guard-3-8B
```
Also evaluated on a small benchmark with 20 questions from HotpotQA.
With this PR and some prompting, the performance is 77% recall compared
to 50% currently.
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1015).
* #1268
* #1239
* __->__ #1015
Summary:
Allows tools to output metadata. This is useful for evaluating tool
outputs, e.g. RAG tool will output document IDs, which can be used to
score recall.
Will need to make a similar change on the client side to support
ClientTool outputting metadata.
Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/client-sdk/agents/test_agents.py
# What does this PR do?
- Remove hardcoded configurations from pre-commit.
- Allow configuration to be set via pyproject.toml.
- Merge .ruff.toml settings into pyproject.toml.
- Ensure the linter and formatter use the defined configuration instead
of being overridden by pre-commit.
Signed-off-by: Sébastien Han <seb@redhat.com>
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
- Configured ruff linter to automatically fix import sorting issues.
- Set --exit-non-zero-on-fix to ensure non-zero exit code when fixes are
applied.
- Enabled the 'I' selection to focus on import-related linting rules.
- Ran the linter, and formatted all codebase imports accordingly.
- Removed the black dep from the "dev" group since we use ruff
Signed-off-by: Sébastien Han <seb@redhat.com>
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)
Signed-off-by: Sébastien Han <seb@redhat.com>
Lint check in main branch is failing. This fixes the lint check after we
moved to ruff in https://github.com/meta-llama/llama-stack/pull/921. We
need to move to a `ruff.toml` file as well as fixing and ignoring some
additional checks.
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>