docs: Adding initial updates to the RAG documentation and examples (#4377)
Some checks failed
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 2s
Integration Tests (Replay) / generate-matrix (push) Successful in 4s
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s
Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped
API Conformance Tests / check-schema-compatibility (push) Successful in 12s
Python Package Build Test / build (3.12) (push) Successful in 18s
Python Package Build Test / build (3.13) (push) Successful in 22s
Test External API and Providers / test-external (venv) (push) Failing after 37s
Vector IO Integration Tests / test-matrix (push) Failing after 46s
UI Tests / ui-tests (22) (push) Successful in 1m23s
Unit Tests / unit-tests (3.12) (push) Failing after 1m48s
Unit Tests / unit-tests (3.13) (push) Failing after 1m50s
Pre-commit / pre-commit (22) (push) Successful in 3m31s
Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 4m20s

# What does this PR do?
This PR updates the RAG examples included in docs/quick_start.ipynb,
docs/getting_started/demo_script.py, rag.mdx and index.md to remove
references to the deprecated vector_io and vector_db APIs and to add
examples that use /v1/vector_stores with responses and completions.

---------

Co-authored-by: Omar Abdelwahab <omara@fb.com>
Co-authored-by: Francisco Javier Arceo <arceofrancisco@gmail.com>
This commit is contained in:
Omar Abdelwahab 2025-12-12 19:59:39 -08:00 committed by GitHub
parent 75ef052545
commit dfb9f6743a
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 625 additions and 69 deletions

View file

@ -24,9 +24,66 @@ llama stack list-deps starter | xargs -L1 uv pip install
llama stack run starter
```
### 2. Connect with OpenAI Client
### 2. Choose Your Approach
Llama Stack supports various approaches for building RAG applications. The server provides two APIs (Responses and Chat Completions), plus a high-level client wrapper (Agent class):
#### Approach 1: Agent Class (Client-Side)
The **Agent class** is a high-level client wrapper around the Responses API with automatic tool execution and session management. Best for conversational agents and multi-turn RAG.
```python
from llama_stack_client import Agent, AgentEventLogger, LlamaStackClient
import requests
from io import BytesIO
client = LlamaStackClient(base_url="http://localhost:8321")
# Create vector store
vs = client.vector_stores.create(name="my_vector_db")
# Upload document
url = "https://www.paulgraham.com/greatwork.html"
response = requests.get(url)
file_buffer = BytesIO(response.content)
file_buffer.name = "greatwork.html"
file = client.files.create(file=file_buffer, purpose="assistants")
client.vector_stores.files.create(vector_store_id=vs.id, file_id=file.id)
# Create agent with file_search tool (client-side wrapper)
agent = Agent(
client,
model="ollama/llama3.2:3b",
instructions="You are a helpful assistant",
tools=[
{
"type": "file_search",
"vector_store_ids": [vs.id], # Agent searches this automatically
}
],
)
# Just ask - agent handles retrieval automatically
response = agent.create_turn(
messages=[{"role": "user", "content": "How do you do great work?"}],
session_id=agent.create_session("my_session"),
stream=True,
)
for log in AgentEventLogger().log(response):
print(log, end="")
```
**How it works:**
- Client-side `Agent` class wraps the Responses API
- Agent automatically decides when to search the vector store
- Uses internal Python API for vector search (no HTTP overhead)
- Maintains conversation context across turns
- Best for: Interactive applications, chatbots, multi-turn conversations
#### Approach 2: Responses API
In another terminal, use the standard OpenAI client with the Responses API:
```python
import io, requests
@ -35,7 +92,7 @@ from openai import OpenAI
url = "https://www.paulgraham.com/greatwork.html"
client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")
# Create vector store - auto-detects default embedding model
# Create vector store
vs = client.vector_stores.create()
response = requests.get(url)
@ -43,17 +100,59 @@ pseudo_file = io.BytesIO(str(response.content).encode('utf-8'))
file_id = client.files.create(file=(url, pseudo_file, "text/html"), purpose="assistants").id
client.vector_stores.files.create(vector_store_id=vs.id, file_id=file_id)
# Automatic tool calling (calls Responses API directly)
resp = client.responses.create(
model="gpt-4o",
input="How do you do great work? Use the existing knowledge_search tool.",
input="How do you do great work?",
tools=[{"type": "file_search", "vector_store_ids": [vs.id]}],
include=["file_search_call.results"],
)
print(resp.output[-1].content[-1].text)
```
Which should give output like:
```
**How it works:**
- Server-side API with automatic tool calling
- Uses internal Python API for vector search
- No built-in session management (stateless by default)
- Best for: Single-turn queries, OpenAI-compatible applications
#### Approach 3: Chat Completions API
The **Chat Completions API** is a server-side API that gives you explicit control over retrieval and generation. Best for custom RAG pipelines and batch processing.
```python
import io, requests
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")
# Create vector store and add documents
vs = client.vector_stores.create()
# ... upload and add files ...
# Explicitly search vector store via REST API
query = "How do you do great work?"
search_results = client.vector_stores.search(
vector_store_id=vs.id,
query=query,
limit=3
)
# Manually extract context
context = "\n\n".join([r.content for r in search_results.data if r.content])
# Manually construct prompt with context
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Use the provided context to answer questions."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
]
)
print(completion.choices[0].message.content)
Doing great work is about more than just hard work and ambition; it involves combining several elements:
1. **Pursue What Excites You**: Engage in projects that are both ambitious and exciting to you. It's important to work on something you have a natural aptitude for and a deep interest in.

View file

@ -4,24 +4,132 @@
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
"""
Demo script showing RAG with both Responses API and Chat Completions API.
import io, requests
This example demonstrates two approaches to RAG with Llama Stack:
1. Responses API - Automatic agentic tool calling with file search
2. Chat Completions API - Manual retrieval with explicit control
Run this script after starting a Llama Stack server:
llama stack run starter
"""
import io
import requests
from openai import OpenAI
url="https://www.paulgraham.com/greatwork.html"
# Initialize OpenAI client pointing to Llama Stack server
client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")
# Shared setup: Create vector store and upload document
print("=" * 80)
print("SETUP: Creating vector store and uploading document")
print("=" * 80)
url = "https://www.paulgraham.com/greatwork.html"
print(f"Fetching document from: {url}")
vs = client.vector_stores.create()
print(f"Vector store created: {vs.id}")
response = requests.get(url)
pseudo_file = io.BytesIO(str(response.content).encode('utf-8'))
uploaded_file = client.files.create(file=(url, pseudo_file, "text/html"), purpose="assistants")
pseudo_file = io.BytesIO(str(response.content).encode("utf-8"))
uploaded_file = client.files.create(
file=(url, pseudo_file, "text/html"), purpose="assistants"
)
client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file.id)
print(f"File uploaded and added to vector store: {uploaded_file.id}")
query = "How do you do great work?"
# ============================================================================
# APPROACH 1: Responses API (Recommended for most use cases)
# ============================================================================
print("\n" + "=" * 80)
print("APPROACH 1: Responses API (Automatic Tool Calling)")
print("=" * 80)
print(f"Query: {query}\n")
resp = client.responses.create(
model="openai/gpt-4o",
input="How do you do great work? Use the existing knowledge_search tool.",
model="ollama/llama3.2:3b", # feel free to change this to any other model
input=query,
tools=[{"type": "file_search", "vector_store_ids": [vs.id]}],
include=["file_search_call.results"],
)
print(resp)
print("Response (Responses API):")
print("-" * 80)
print(resp.output[-1].content[-1].text)
print("-" * 80)
# ============================================================================
# APPROACH 2: Chat Completions API
# ============================================================================
print("\n" + "=" * 80)
print("APPROACH 2: Chat Completions API (Manual Retrieval)")
print("=" * 80)
print(f"Query: {query}\n")
# Step 1: Search vector store explicitly
print("Searching vector store...")
search_results = client.vector_stores.search(
vector_store_id=vs.id, query=query, max_num_results=3, rewrite_query=False
)
# Step 2: Extract context from search results
context_chunks = []
for result in search_results.data:
# result.content is a list of Content objects, extract the text from each
if hasattr(result, "content") and result.content:
for content_item in result.content:
if hasattr(content_item, "text") and content_item.text:
context_chunks.append(content_item.text)
context = "\n\n".join(context_chunks)
print(f"Found {len(context_chunks)} relevant chunks\n")
# Step 3: Use Chat Completions with retrieved context
print("Generating response with chat completions...")
completion = client.chat.completions.create(
model="ollama/llama3.2:3b", # Feel free to change this to any other model
messages=[
{
"role": "system",
"content": "You are a helpful assistant. Use the provided context to answer the user's question.",
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}\n\nPlease provide a comprehensive answer based on the context above.",
},
],
temperature=0.7,
)
print("Response (Chat Completions API):")
print("-" * 80)
print(completion.choices[0].message.content)
print("-" * 80)
# ============================================================================
# Summary
# ============================================================================
print("\n" + "=" * 80)
print("SUMMARY")
print("=" * 80)
print(
"""
Both approaches successfully performed RAG:
1. Responses API:
- Automatic tool calling (model decides when to search)
- Simpler code, less control
- Best for: Conversational agents, automatic workflows
2. Chat Completions API:
- Manual retrieval (you control the search)
- More code, more control
- Best for: Custom RAG patterns, batch processing, specialized workflows
"""
)

View file

@ -220,6 +220,20 @@ Methods:
## VectorIo
:::warning DEPRECATED API
**This API is deprecated and will be removed in a future version.**
Use the OpenAI-compatible [Vector Stores API](#vectorstores) instead:
- Instead of `client.vector_io.insert()`, use `client.vector_stores.files.create()` and `client.vector_stores.files.chunks.create()`
- Instead of `client.vector_io.query()`, use `client.vector_stores.search()`
See the [RAG documentation](/docs/building_applications/rag) for migration examples.
Related: [Issue #2981](https://github.com/meta-llama/llama-stack/issues/2981)
:::
Types:
```python
@ -233,6 +247,22 @@ Methods:
## VectorDBs
:::warning DEPRECATED API
**This API is deprecated and will be removed in a future version.**
Use the OpenAI-compatible [Vector Stores API](#vectorstores) instead:
- Instead of `client.vector_dbs.register()`, use `client.vector_stores.create()`
- Instead of `client.vector_dbs.list()`, use `client.vector_stores.list()`
- Instead of `client.vector_dbs.retrieve()`, use `client.vector_stores.retrieve()`
- Instead of `client.vector_dbs.unregister()`, use `client.vector_stores.delete()`
See the [RAG documentation](/docs/building_applications/rag) for migration examples.
Related: [Issue #2981](https://github.com/meta-llama/llama-stack/issues/2981)
:::
Types:
```python