docs: Adding initial updates to the RAG documentation and examples (#4377)

# What does this PR do? This PR updates the RAG examples included in docs/quick_start.ipynb, docs/getting_started/demo_script.py, rag.mdx and index.md to remove references to the deprecated vector_io and vector_db APIs and to add examples that use /v1/vector_stores with responses and completions. --------- Co-authored-by: Omar Abdelwahab <omara@fb.com> Co-authored-by: Francisco Javier Arceo <arceofrancisco@gmail.com>
2025-12-17 11:22:35 +00:00 · 2025-12-12 19:59:39 -08:00 · 2025-12-12 19:59:39 -08:00 · dfb9f6743a
commit dfb9f6743a
parent 75ef052545
4 changed files with 625 additions and 69 deletions
--- a/docs/docs/building_applications/rag.mdx
+++ b/docs/docs/building_applications/rag.mdx
@ -24,9 +24,66 @@ llama stack list-deps starter | xargs -L1 uv pip install
 llama stack run starter
 ```

-### 2. Connect with OpenAI Client
+### 2. Choose Your Approach
+
+Llama Stack supports various approaches for building RAG applications. The server provides two APIs (Responses and Chat Completions), plus a high-level client wrapper (Agent class):
+
+#### Approach 1: Agent Class (Client-Side)
+
+The **Agent class** is a high-level client wrapper around the Responses API with automatic tool execution and session management. Best for conversational agents and multi-turn RAG.
+
+```python
+from llama_stack_client import Agent, AgentEventLogger, LlamaStackClient
+import requests
+from io import BytesIO
+
+client = LlamaStackClient(base_url="http://localhost:8321")
+
+# Create vector store
+vs = client.vector_stores.create(name="my_vector_db")
+
+# Upload document
+url = "https://www.paulgraham.com/greatwork.html"
+response = requests.get(url)
+file_buffer = BytesIO(response.content)
+file_buffer.name = "greatwork.html"
+
+file = client.files.create(file=file_buffer, purpose="assistants")
+client.vector_stores.files.create(vector_store_id=vs.id, file_id=file.id)
+
+# Create agent with file_search tool (client-side wrapper)
+agent = Agent(
+    client,
+    model="ollama/llama3.2:3b",
+    instructions="You are a helpful assistant",
+    tools=[
+        {
+            "type": "file_search",
+            "vector_store_ids": [vs.id],  # Agent searches this automatically
+        }
+    ],
+)
+
+# Just ask - agent handles retrieval automatically
+response = agent.create_turn(
+    messages=[{"role": "user", "content": "How do you do great work?"}],
+    session_id=agent.create_session("my_session"),
+    stream=True,
+)
+
+for log in AgentEventLogger().log(response):
+    print(log, end="")
+```
+
+**How it works:**
+- Client-side `Agent` class wraps the Responses API
+- Agent automatically decides when to search the vector store
+- Uses internal Python API for vector search (no HTTP overhead)
+- Maintains conversation context across turns
+- Best for: Interactive applications, chatbots, multi-turn conversations
+
+#### Approach 2: Responses API

-In another terminal, use the standard OpenAI client with the Responses API:

 ```python
 import io, requests
@ -35,7 +92,7 @@ from openai import OpenAI
 url = "https://www.paulgraham.com/greatwork.html"
 client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")

-# Create vector store - auto-detects default embedding model
+# Create vector store
 vs = client.vector_stores.create()

 response = requests.get(url)
@ -43,17 +100,59 @@ pseudo_file = io.BytesIO(str(response.content).encode('utf-8'))
 file_id = client.files.create(file=(url, pseudo_file, "text/html"), purpose="assistants").id
 client.vector_stores.files.create(vector_store_id=vs.id, file_id=file_id)

+# Automatic tool calling (calls Responses API directly)
 resp = client.responses.create(
    model="gpt-4o",
-    input="How do you do great work? Use the existing knowledge_search tool.",
+    input="How do you do great work?",
    tools=[{"type": "file_search", "vector_store_ids": [vs.id]}],
    include=["file_search_call.results"],
 )

 print(resp.output[-1].content[-1].text)
 ```
-Which should give output like:
-```
+
+**How it works:**
+- Server-side API with automatic tool calling
+- Uses internal Python API for vector search
+- No built-in session management (stateless by default)
+- Best for: Single-turn queries, OpenAI-compatible applications
+
+#### Approach 3: Chat Completions API
+
+The **Chat Completions API** is a server-side API that gives you explicit control over retrieval and generation. Best for custom RAG pipelines and batch processing.
+
+```python
+import io, requests
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")
+
+# Create vector store and add documents
+vs = client.vector_stores.create()
+# ... upload and add files ...
+
+# Explicitly search vector store via REST API
+query = "How do you do great work?"
+search_results = client.vector_stores.search(
+    vector_store_id=vs.id,
+    query=query,
+    limit=3
+)
+
+# Manually extract context
+context = "\n\n".join([r.content for r in search_results.data if r.content])
+
+# Manually construct prompt with context
+completion = client.chat.completions.create(
+    model="gpt-4o",
+    messages=[
+        {"role": "system", "content": "Use the provided context to answer questions."},
+        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
+    ]
+)
+
+print(completion.choices[0].message.content)
+
 Doing great work is about more than just hard work and ambition; it involves combining several elements:

 1. **Pursue What Excites You**: Engage in projects that are both ambitious and exciting to you. It's important to work on something you have a natural aptitude for and a deep interest in.
--- a/docs/docs/getting_started/demo_script.py
+++ b/docs/docs/getting_started/demo_script.py
@ -4,24 +4,132 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

+"""
+Demo script showing RAG with both Responses API and Chat Completions API.

-import io, requests
+This example demonstrates two approaches to RAG with Llama Stack:
+1. Responses API - Automatic agentic tool calling with file search
+2. Chat Completions API - Manual retrieval with explicit control
+
+Run this script after starting a Llama Stack server:
+    llama stack run starter
+"""
+
+import io
+
+import requests
 from openai import OpenAI

-url="https://www.paulgraham.com/greatwork.html"
+# Initialize OpenAI client pointing to Llama Stack server
 client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")

+# Shared setup: Create vector store and upload document
+print("=" * 80)
+print("SETUP: Creating vector store and uploading document")
+print("=" * 80)
+
+url = "https://www.paulgraham.com/greatwork.html"
+print(f"Fetching document from: {url}")
+
 vs = client.vector_stores.create()
+print(f"Vector store created: {vs.id}")
+
 response = requests.get(url)
-pseudo_file = io.BytesIO(str(response.content).encode('utf-8'))
-uploaded_file = client.files.create(file=(url, pseudo_file, "text/html"), purpose="assistants")
+pseudo_file = io.BytesIO(str(response.content).encode("utf-8"))
+uploaded_file = client.files.create(
+    file=(url, pseudo_file, "text/html"), purpose="assistants"
+)
 client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file.id)
+print(f"File uploaded and added to vector store: {uploaded_file.id}")
+
+query = "How do you do great work?"
+
+# ============================================================================
+# APPROACH 1: Responses API (Recommended for most use cases)
+# ============================================================================
+print("\n" + "=" * 80)
+print("APPROACH 1: Responses API (Automatic Tool Calling)")
+print("=" * 80)
+print(f"Query: {query}\n")

 resp = client.responses.create(
-    model="openai/gpt-4o",
-    input="How do you do great work? Use the existing knowledge_search tool.",
+    model="ollama/llama3.2:3b",  # feel free to change this to any other model
+    input=query,
    tools=[{"type": "file_search", "vector_store_ids": [vs.id]}],
    include=["file_search_call.results"],
 )

-print(resp)
+print("Response (Responses API):")
+print("-" * 80)
+print(resp.output[-1].content[-1].text)
+print("-" * 80)
+
+# ============================================================================
+# APPROACH 2: Chat Completions API
+# ============================================================================
+print("\n" + "=" * 80)
+print("APPROACH 2: Chat Completions API (Manual Retrieval)")
+print("=" * 80)
+print(f"Query: {query}\n")
+
+# Step 1: Search vector store explicitly
+print("Searching vector store...")
+search_results = client.vector_stores.search(
+    vector_store_id=vs.id, query=query, max_num_results=3, rewrite_query=False
+)
+
+# Step 2: Extract context from search results
+context_chunks = []
+for result in search_results.data:
+    # result.content is a list of Content objects, extract the text from each
+    if hasattr(result, "content") and result.content:
+        for content_item in result.content:
+            if hasattr(content_item, "text") and content_item.text:
+                context_chunks.append(content_item.text)
+
+context = "\n\n".join(context_chunks)
+print(f"Found {len(context_chunks)} relevant chunks\n")
+
+# Step 3: Use Chat Completions with retrieved context
+print("Generating response with chat completions...")
+completion = client.chat.completions.create(
+    model="ollama/llama3.2:3b",  # Feel free to change this to any other model
+    messages=[
+        {
+            "role": "system",
+            "content": "You are a helpful assistant. Use the provided context to answer the user's question.",
+        },
+        {
+            "role": "user",
+            "content": f"Context:\n{context}\n\nQuestion: {query}\n\nPlease provide a comprehensive answer based on the context above.",
+        },
+    ],
+    temperature=0.7,
+)
+
+print("Response (Chat Completions API):")
+print("-" * 80)
+print(completion.choices[0].message.content)
+print("-" * 80)
+
+# ============================================================================
+# Summary
+# ============================================================================
+print("\n" + "=" * 80)
+print("SUMMARY")
+print("=" * 80)
+print(
+    """
+Both approaches successfully performed RAG:
+
+1. Responses API:
+   - Automatic tool calling (model decides when to search)
+   - Simpler code, less control
+   - Best for: Conversational agents, automatic workflows
+
+2. Chat Completions API:
+   - Manual retrieval (you control the search)
+   - More code, more control
+   - Best for: Custom RAG patterns, batch processing, specialized workflows
+"""
+)
--- a/docs/docs/references/python_sdk_reference/index.md
+++ b/docs/docs/references/python_sdk_reference/index.md
@ -220,6 +220,20 @@ Methods:

 ## VectorIo

+:::warning DEPRECATED API
+
+**This API is deprecated and will be removed in a future version.**
+
+Use the OpenAI-compatible [Vector Stores API](#vectorstores) instead:
+- Instead of `client.vector_io.insert()`, use `client.vector_stores.files.create()` and `client.vector_stores.files.chunks.create()`
+- Instead of `client.vector_io.query()`, use `client.vector_stores.search()`
+
+See the [RAG documentation](/docs/building_applications/rag) for migration examples.
+
+Related: [Issue #2981](https://github.com/meta-llama/llama-stack/issues/2981)
+
+:::
+
 Types:

 ```python
@ -233,6 +247,22 @@ Methods:

 ## VectorDBs

+:::warning DEPRECATED API
+
+**This API is deprecated and will be removed in a future version.**
+
+Use the OpenAI-compatible [Vector Stores API](#vectorstores) instead:
+- Instead of `client.vector_dbs.register()`, use `client.vector_stores.create()`
+- Instead of `client.vector_dbs.list()`, use `client.vector_stores.list()`
+- Instead of `client.vector_dbs.retrieve()`, use `client.vector_stores.retrieve()`
+- Instead of `client.vector_dbs.unregister()`, use `client.vector_stores.delete()`
+
+See the [RAG documentation](/docs/building_applications/rag) for migration examples.
+
+Related: [Issue #2981](https://github.com/meta-llama/llama-stack/issues/2981)
+
+:::
+
 Types:

 ```python