mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-17 11:22:35 +00:00
docs: Adding initial updates to the RAG documentation and examples (#4377)
Some checks failed
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 2s
Integration Tests (Replay) / generate-matrix (push) Successful in 4s
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s
Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped
API Conformance Tests / check-schema-compatibility (push) Successful in 12s
Python Package Build Test / build (3.12) (push) Successful in 18s
Python Package Build Test / build (3.13) (push) Successful in 22s
Test External API and Providers / test-external (venv) (push) Failing after 37s
Vector IO Integration Tests / test-matrix (push) Failing after 46s
UI Tests / ui-tests (22) (push) Successful in 1m23s
Unit Tests / unit-tests (3.12) (push) Failing after 1m48s
Unit Tests / unit-tests (3.13) (push) Failing after 1m50s
Pre-commit / pre-commit (22) (push) Successful in 3m31s
Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 4m20s
Some checks failed
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 2s
Integration Tests (Replay) / generate-matrix (push) Successful in 4s
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s
Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped
API Conformance Tests / check-schema-compatibility (push) Successful in 12s
Python Package Build Test / build (3.12) (push) Successful in 18s
Python Package Build Test / build (3.13) (push) Successful in 22s
Test External API and Providers / test-external (venv) (push) Failing after 37s
Vector IO Integration Tests / test-matrix (push) Failing after 46s
UI Tests / ui-tests (22) (push) Successful in 1m23s
Unit Tests / unit-tests (3.12) (push) Failing after 1m48s
Unit Tests / unit-tests (3.13) (push) Failing after 1m50s
Pre-commit / pre-commit (22) (push) Successful in 3m31s
Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 4m20s
# What does this PR do? This PR updates the RAG examples included in docs/quick_start.ipynb, docs/getting_started/demo_script.py, rag.mdx and index.md to remove references to the deprecated vector_io and vector_db APIs and to add examples that use /v1/vector_stores with responses and completions. --------- Co-authored-by: Omar Abdelwahab <omara@fb.com> Co-authored-by: Francisco Javier Arceo <arceofrancisco@gmail.com>
This commit is contained in:
parent
75ef052545
commit
dfb9f6743a
4 changed files with 625 additions and 69 deletions
|
|
@ -24,9 +24,66 @@ llama stack list-deps starter | xargs -L1 uv pip install
|
|||
llama stack run starter
|
||||
```
|
||||
|
||||
### 2. Connect with OpenAI Client
|
||||
### 2. Choose Your Approach
|
||||
|
||||
Llama Stack supports various approaches for building RAG applications. The server provides two APIs (Responses and Chat Completions), plus a high-level client wrapper (Agent class):
|
||||
|
||||
#### Approach 1: Agent Class (Client-Side)
|
||||
|
||||
The **Agent class** is a high-level client wrapper around the Responses API with automatic tool execution and session management. Best for conversational agents and multi-turn RAG.
|
||||
|
||||
```python
|
||||
from llama_stack_client import Agent, AgentEventLogger, LlamaStackClient
|
||||
import requests
|
||||
from io import BytesIO
|
||||
|
||||
client = LlamaStackClient(base_url="http://localhost:8321")
|
||||
|
||||
# Create vector store
|
||||
vs = client.vector_stores.create(name="my_vector_db")
|
||||
|
||||
# Upload document
|
||||
url = "https://www.paulgraham.com/greatwork.html"
|
||||
response = requests.get(url)
|
||||
file_buffer = BytesIO(response.content)
|
||||
file_buffer.name = "greatwork.html"
|
||||
|
||||
file = client.files.create(file=file_buffer, purpose="assistants")
|
||||
client.vector_stores.files.create(vector_store_id=vs.id, file_id=file.id)
|
||||
|
||||
# Create agent with file_search tool (client-side wrapper)
|
||||
agent = Agent(
|
||||
client,
|
||||
model="ollama/llama3.2:3b",
|
||||
instructions="You are a helpful assistant",
|
||||
tools=[
|
||||
{
|
||||
"type": "file_search",
|
||||
"vector_store_ids": [vs.id], # Agent searches this automatically
|
||||
}
|
||||
],
|
||||
)
|
||||
|
||||
# Just ask - agent handles retrieval automatically
|
||||
response = agent.create_turn(
|
||||
messages=[{"role": "user", "content": "How do you do great work?"}],
|
||||
session_id=agent.create_session("my_session"),
|
||||
stream=True,
|
||||
)
|
||||
|
||||
for log in AgentEventLogger().log(response):
|
||||
print(log, end="")
|
||||
```
|
||||
|
||||
**How it works:**
|
||||
- Client-side `Agent` class wraps the Responses API
|
||||
- Agent automatically decides when to search the vector store
|
||||
- Uses internal Python API for vector search (no HTTP overhead)
|
||||
- Maintains conversation context across turns
|
||||
- Best for: Interactive applications, chatbots, multi-turn conversations
|
||||
|
||||
#### Approach 2: Responses API
|
||||
|
||||
In another terminal, use the standard OpenAI client with the Responses API:
|
||||
|
||||
```python
|
||||
import io, requests
|
||||
|
|
@ -35,7 +92,7 @@ from openai import OpenAI
|
|||
url = "https://www.paulgraham.com/greatwork.html"
|
||||
client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")
|
||||
|
||||
# Create vector store - auto-detects default embedding model
|
||||
# Create vector store
|
||||
vs = client.vector_stores.create()
|
||||
|
||||
response = requests.get(url)
|
||||
|
|
@ -43,17 +100,59 @@ pseudo_file = io.BytesIO(str(response.content).encode('utf-8'))
|
|||
file_id = client.files.create(file=(url, pseudo_file, "text/html"), purpose="assistants").id
|
||||
client.vector_stores.files.create(vector_store_id=vs.id, file_id=file_id)
|
||||
|
||||
# Automatic tool calling (calls Responses API directly)
|
||||
resp = client.responses.create(
|
||||
model="gpt-4o",
|
||||
input="How do you do great work? Use the existing knowledge_search tool.",
|
||||
input="How do you do great work?",
|
||||
tools=[{"type": "file_search", "vector_store_ids": [vs.id]}],
|
||||
include=["file_search_call.results"],
|
||||
)
|
||||
|
||||
print(resp.output[-1].content[-1].text)
|
||||
```
|
||||
Which should give output like:
|
||||
```
|
||||
|
||||
**How it works:**
|
||||
- Server-side API with automatic tool calling
|
||||
- Uses internal Python API for vector search
|
||||
- No built-in session management (stateless by default)
|
||||
- Best for: Single-turn queries, OpenAI-compatible applications
|
||||
|
||||
#### Approach 3: Chat Completions API
|
||||
|
||||
The **Chat Completions API** is a server-side API that gives you explicit control over retrieval and generation. Best for custom RAG pipelines and batch processing.
|
||||
|
||||
```python
|
||||
import io, requests
|
||||
from openai import OpenAI
|
||||
|
||||
client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")
|
||||
|
||||
# Create vector store and add documents
|
||||
vs = client.vector_stores.create()
|
||||
# ... upload and add files ...
|
||||
|
||||
# Explicitly search vector store via REST API
|
||||
query = "How do you do great work?"
|
||||
search_results = client.vector_stores.search(
|
||||
vector_store_id=vs.id,
|
||||
query=query,
|
||||
limit=3
|
||||
)
|
||||
|
||||
# Manually extract context
|
||||
context = "\n\n".join([r.content for r in search_results.data if r.content])
|
||||
|
||||
# Manually construct prompt with context
|
||||
completion = client.chat.completions.create(
|
||||
model="gpt-4o",
|
||||
messages=[
|
||||
{"role": "system", "content": "Use the provided context to answer questions."},
|
||||
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
|
||||
]
|
||||
)
|
||||
|
||||
print(completion.choices[0].message.content)
|
||||
|
||||
Doing great work is about more than just hard work and ambition; it involves combining several elements:
|
||||
|
||||
1. **Pursue What Excites You**: Engage in projects that are both ambitious and exciting to you. It's important to work on something you have a natural aptitude for and a deep interest in.
|
||||
|
|
|
|||
|
|
@ -4,24 +4,132 @@
|
|||
# This source code is licensed under the terms described in the LICENSE file in
|
||||
# the root directory of this source tree.
|
||||
|
||||
"""
|
||||
Demo script showing RAG with both Responses API and Chat Completions API.
|
||||
|
||||
import io, requests
|
||||
This example demonstrates two approaches to RAG with Llama Stack:
|
||||
1. Responses API - Automatic agentic tool calling with file search
|
||||
2. Chat Completions API - Manual retrieval with explicit control
|
||||
|
||||
Run this script after starting a Llama Stack server:
|
||||
llama stack run starter
|
||||
"""
|
||||
|
||||
import io
|
||||
|
||||
import requests
|
||||
from openai import OpenAI
|
||||
|
||||
url="https://www.paulgraham.com/greatwork.html"
|
||||
# Initialize OpenAI client pointing to Llama Stack server
|
||||
client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")
|
||||
|
||||
# Shared setup: Create vector store and upload document
|
||||
print("=" * 80)
|
||||
print("SETUP: Creating vector store and uploading document")
|
||||
print("=" * 80)
|
||||
|
||||
url = "https://www.paulgraham.com/greatwork.html"
|
||||
print(f"Fetching document from: {url}")
|
||||
|
||||
vs = client.vector_stores.create()
|
||||
print(f"Vector store created: {vs.id}")
|
||||
|
||||
response = requests.get(url)
|
||||
pseudo_file = io.BytesIO(str(response.content).encode('utf-8'))
|
||||
uploaded_file = client.files.create(file=(url, pseudo_file, "text/html"), purpose="assistants")
|
||||
pseudo_file = io.BytesIO(str(response.content).encode("utf-8"))
|
||||
uploaded_file = client.files.create(
|
||||
file=(url, pseudo_file, "text/html"), purpose="assistants"
|
||||
)
|
||||
client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file.id)
|
||||
print(f"File uploaded and added to vector store: {uploaded_file.id}")
|
||||
|
||||
query = "How do you do great work?"
|
||||
|
||||
# ============================================================================
|
||||
# APPROACH 1: Responses API (Recommended for most use cases)
|
||||
# ============================================================================
|
||||
print("\n" + "=" * 80)
|
||||
print("APPROACH 1: Responses API (Automatic Tool Calling)")
|
||||
print("=" * 80)
|
||||
print(f"Query: {query}\n")
|
||||
|
||||
resp = client.responses.create(
|
||||
model="openai/gpt-4o",
|
||||
input="How do you do great work? Use the existing knowledge_search tool.",
|
||||
model="ollama/llama3.2:3b", # feel free to change this to any other model
|
||||
input=query,
|
||||
tools=[{"type": "file_search", "vector_store_ids": [vs.id]}],
|
||||
include=["file_search_call.results"],
|
||||
)
|
||||
|
||||
print(resp)
|
||||
print("Response (Responses API):")
|
||||
print("-" * 80)
|
||||
print(resp.output[-1].content[-1].text)
|
||||
print("-" * 80)
|
||||
|
||||
# ============================================================================
|
||||
# APPROACH 2: Chat Completions API
|
||||
# ============================================================================
|
||||
print("\n" + "=" * 80)
|
||||
print("APPROACH 2: Chat Completions API (Manual Retrieval)")
|
||||
print("=" * 80)
|
||||
print(f"Query: {query}\n")
|
||||
|
||||
# Step 1: Search vector store explicitly
|
||||
print("Searching vector store...")
|
||||
search_results = client.vector_stores.search(
|
||||
vector_store_id=vs.id, query=query, max_num_results=3, rewrite_query=False
|
||||
)
|
||||
|
||||
# Step 2: Extract context from search results
|
||||
context_chunks = []
|
||||
for result in search_results.data:
|
||||
# result.content is a list of Content objects, extract the text from each
|
||||
if hasattr(result, "content") and result.content:
|
||||
for content_item in result.content:
|
||||
if hasattr(content_item, "text") and content_item.text:
|
||||
context_chunks.append(content_item.text)
|
||||
|
||||
context = "\n\n".join(context_chunks)
|
||||
print(f"Found {len(context_chunks)} relevant chunks\n")
|
||||
|
||||
# Step 3: Use Chat Completions with retrieved context
|
||||
print("Generating response with chat completions...")
|
||||
completion = client.chat.completions.create(
|
||||
model="ollama/llama3.2:3b", # Feel free to change this to any other model
|
||||
messages=[
|
||||
{
|
||||
"role": "system",
|
||||
"content": "You are a helpful assistant. Use the provided context to answer the user's question.",
|
||||
},
|
||||
{
|
||||
"role": "user",
|
||||
"content": f"Context:\n{context}\n\nQuestion: {query}\n\nPlease provide a comprehensive answer based on the context above.",
|
||||
},
|
||||
],
|
||||
temperature=0.7,
|
||||
)
|
||||
|
||||
print("Response (Chat Completions API):")
|
||||
print("-" * 80)
|
||||
print(completion.choices[0].message.content)
|
||||
print("-" * 80)
|
||||
|
||||
# ============================================================================
|
||||
# Summary
|
||||
# ============================================================================
|
||||
print("\n" + "=" * 80)
|
||||
print("SUMMARY")
|
||||
print("=" * 80)
|
||||
print(
|
||||
"""
|
||||
Both approaches successfully performed RAG:
|
||||
|
||||
1. Responses API:
|
||||
- Automatic tool calling (model decides when to search)
|
||||
- Simpler code, less control
|
||||
- Best for: Conversational agents, automatic workflows
|
||||
|
||||
2. Chat Completions API:
|
||||
- Manual retrieval (you control the search)
|
||||
- More code, more control
|
||||
- Best for: Custom RAG patterns, batch processing, specialized workflows
|
||||
"""
|
||||
)
|
||||
|
|
|
|||
|
|
@ -220,6 +220,20 @@ Methods:
|
|||
|
||||
## VectorIo
|
||||
|
||||
:::warning DEPRECATED API
|
||||
|
||||
**This API is deprecated and will be removed in a future version.**
|
||||
|
||||
Use the OpenAI-compatible [Vector Stores API](#vectorstores) instead:
|
||||
- Instead of `client.vector_io.insert()`, use `client.vector_stores.files.create()` and `client.vector_stores.files.chunks.create()`
|
||||
- Instead of `client.vector_io.query()`, use `client.vector_stores.search()`
|
||||
|
||||
See the [RAG documentation](/docs/building_applications/rag) for migration examples.
|
||||
|
||||
Related: [Issue #2981](https://github.com/meta-llama/llama-stack/issues/2981)
|
||||
|
||||
:::
|
||||
|
||||
Types:
|
||||
|
||||
```python
|
||||
|
|
@ -233,6 +247,22 @@ Methods:
|
|||
|
||||
## VectorDBs
|
||||
|
||||
:::warning DEPRECATED API
|
||||
|
||||
**This API is deprecated and will be removed in a future version.**
|
||||
|
||||
Use the OpenAI-compatible [Vector Stores API](#vectorstores) instead:
|
||||
- Instead of `client.vector_dbs.register()`, use `client.vector_stores.create()`
|
||||
- Instead of `client.vector_dbs.list()`, use `client.vector_stores.list()`
|
||||
- Instead of `client.vector_dbs.retrieve()`, use `client.vector_stores.retrieve()`
|
||||
- Instead of `client.vector_dbs.unregister()`, use `client.vector_stores.delete()`
|
||||
|
||||
See the [RAG documentation](/docs/building_applications/rag) for migration examples.
|
||||
|
||||
Related: [Issue #2981](https://github.com/meta-llama/llama-stack/issues/2981)
|
||||
|
||||
:::
|
||||
|
||||
Types:
|
||||
|
||||
```python
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue