mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-06-28 02:53:30 +00:00
# What does this PR do? The agent API allows to query multiple DBs using the `vector_db_ids` argument of the `rag` tool: ```py toolgroups=[ { "name": "builtin::rag", "args": {"vector_db_ids": [vector_db_id]}, } ], ``` This means that multiple DBs can be used to compose an aggregated context by executing the query on each of them. When documents are passed to the next agent turn, there is no explicit way to configure the vector DB where the embeddings will be ingested. In such cases, we can assume that: - if any `vector_db_ids` is given, we use the first one (it probably makes sense to assume that it's the only one in the list, otherwise we should loop on all the given DBs to have a consistent ingestion) - if no `vector_db_ids` is given, we can use the current logic to generate a default DB using the default provider. If multiple providers are defined, the API will fail as expected: the user has to provide details on where to ingest the documents. (Closes #1270) ## Test Plan The issue description details how to replicate the problem. [//]: # (## Documentation) --------- Signed-off-by: Daniele Martinoli <dmartino@redhat.com>
154 lines
4.9 KiB
Markdown
154 lines
4.9 KiB
Markdown
## Using Retrieval Augmented Generation (RAG)
|
|
|
|
RAG enables your applications to reference and recall information from previous interactions or external documents.
|
|
|
|
Llama Stack organizes the APIs that enable RAG into three layers:
|
|
- the lowermost APIs deal with raw storage and retrieval. These include Vector IO, KeyValue IO (coming soon) and Relational IO (also coming soon.)
|
|
- next is the "Rag Tool", a first-class tool as part of the Tools API that allows you to ingest documents (from URLs, files, etc) with various chunking strategies and query them smartly.
|
|
- finally, it all comes together with the top-level "Agents" API that allows you to create agents that can use the tools to answer questions, perform tasks, and more.
|
|
|
|
<img src="rag.png" alt="RAG System" width="50%">
|
|
|
|
The RAG system uses lower-level storage for different types of data:
|
|
* **Vector IO**: For semantic search and retrieval
|
|
* **Key-Value and Relational IO**: For structured data storage
|
|
|
|
We may add more storage types like Graph IO in the future.
|
|
|
|
### Setting up Vector DBs
|
|
|
|
Here's how to set up a vector database for RAG:
|
|
|
|
```python
|
|
# Register a vector db
|
|
vector_db_id = "my_documents"
|
|
response = client.vector_dbs.register(
|
|
vector_db_id=vector_db_id,
|
|
embedding_model="all-MiniLM-L6-v2",
|
|
embedding_dimension=384,
|
|
provider_id="faiss",
|
|
)
|
|
|
|
# You can insert a pre-chunked document directly into the vector db
|
|
chunks = [
|
|
{
|
|
"document_id": "doc1",
|
|
"content": "Your document text here",
|
|
"mime_type": "text/plain",
|
|
},
|
|
]
|
|
client.vector_io.insert(vector_db_id=vector_db_id, chunks=chunks)
|
|
|
|
# You can then query for these chunks
|
|
chunks_response = client.vector_io.query(
|
|
vector_db_id=vector_db_id, query="What do you know about..."
|
|
)
|
|
```
|
|
|
|
### Using the RAG Tool
|
|
|
|
A better way to ingest documents is to use the RAG Tool. This tool allows you to ingest documents from URLs, files, etc. and automatically chunks them into smaller pieces.
|
|
|
|
```python
|
|
from llama_stack_client.types import Document
|
|
|
|
urls = ["memory_optimizations.rst", "chat.rst", "llama3.rst"]
|
|
documents = [
|
|
Document(
|
|
document_id=f"num-{i}",
|
|
content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
|
|
mime_type="text/plain",
|
|
metadata={},
|
|
)
|
|
for i, url in enumerate(urls)
|
|
]
|
|
|
|
client.tool_runtime.rag_tool.insert(
|
|
documents=documents,
|
|
vector_db_id=vector_db_id,
|
|
chunk_size_in_tokens=512,
|
|
)
|
|
|
|
# Query documents
|
|
results = client.tool_runtime.rag_tool.query(
|
|
vector_db_ids=[vector_db_id],
|
|
content="What do you know about...",
|
|
)
|
|
```
|
|
|
|
### Building RAG-Enhanced Agents
|
|
|
|
One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example:
|
|
|
|
```python
|
|
from llama_stack_client.types.agent_create_params import AgentConfig
|
|
from llama_stack_client.lib.agents.agent import Agent
|
|
|
|
# Configure agent with memory
|
|
agent_config = AgentConfig(
|
|
model="meta-llama/Llama-3.3-70B-Instruct",
|
|
instructions="You are a helpful assistant",
|
|
enable_session_persistence=False,
|
|
toolgroups=[
|
|
{
|
|
"name": "builtin::rag/knowledge_search",
|
|
"args": {
|
|
"vector_db_ids": [vector_db_id],
|
|
},
|
|
}
|
|
],
|
|
)
|
|
|
|
agent = Agent(client, agent_config)
|
|
session_id = agent.create_session("rag_session")
|
|
|
|
|
|
# Ask questions about documents in the vector db, and the agent will query the db to answer the question.
|
|
response = agent.create_turn(
|
|
messages=[{"role": "user", "content": "How to optimize memory in PyTorch?"}],
|
|
session_id=session_id,
|
|
)
|
|
```
|
|
|
|
> **NOTE:** the `instructions` field in the `AgentConfig` can be used to guide the agent's behavior. It is important to experiment with different instructions to see what works best for your use case.
|
|
|
|
|
|
You can also pass documents along with the user's message and ask questions about them.
|
|
```python
|
|
# Initial document ingestion
|
|
response = agent.create_turn(
|
|
messages=[
|
|
{"role": "user", "content": "I am providing some documents for reference."}
|
|
],
|
|
documents=[
|
|
{
|
|
"content": "https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/memory_optimizations.rst",
|
|
"mime_type": "text/plain",
|
|
}
|
|
],
|
|
session_id=session_id,
|
|
)
|
|
|
|
# Query with RAG
|
|
response = agent.create_turn(
|
|
messages=[{"role": "user", "content": "What are the key topics in the documents?"}],
|
|
session_id=session_id,
|
|
)
|
|
```
|
|
|
|
### Unregistering Vector DBs
|
|
|
|
If you need to clean up and unregister vector databases, you can do so as follows:
|
|
|
|
```python
|
|
# Unregister a specified vector database
|
|
vector_db_id = "my_vector_db_id"
|
|
print(f"Unregistering vector database: {vector_db_id}")
|
|
client.vector_dbs.unregister(vector_db_id=vector_db_id)
|
|
|
|
|
|
# Unregister all vector databases
|
|
for vector_db_id in client.vector_dbs.list():
|
|
print(f"Unregistering vector database: {vector_db_id.identifier}")
|
|
client.vector_dbs.unregister(vector_db_id=vector_db_id.identifier)
|
|
```
|