docs: Adding initial updates to the RAG documentation and examples (#4377)

# What does this PR do? This PR updates the RAG examples included in docs/quick_start.ipynb, docs/getting_started/demo_script.py, rag.mdx and index.md to remove references to the deprecated vector_io and vector_db APIs and to add examples that use /v1/vector_stores with responses and completions. --------- Co-authored-by: Omar Abdelwahab <omara@fb.com> Co-authored-by: Francisco Javier Arceo <arceofrancisco@gmail.com>
2025-12-16 23:32:38 +00:00 · 2025-12-12 19:59:39 -08:00 · 2025-12-12 19:59:39 -08:00 · dfb9f6743a
commit dfb9f6743a
parent 75ef052545
4 changed files with 625 additions and 69 deletions
--- a/docs/docs/building_applications/rag.mdx
+++ b/docs/docs/building_applications/rag.mdx
@ -24,9 +24,66 @@ llama stack list-deps starter | xargs -L1 uv pip install
 llama stack run starter
 ```

-### 2. Connect with OpenAI Client
+### 2. Choose Your Approach
+
+Llama Stack supports various approaches for building RAG applications. The server provides two APIs (Responses and Chat Completions), plus a high-level client wrapper (Agent class):
+
+#### Approach 1: Agent Class (Client-Side)
+
+The **Agent class** is a high-level client wrapper around the Responses API with automatic tool execution and session management. Best for conversational agents and multi-turn RAG.
+
+```python
+from llama_stack_client import Agent, AgentEventLogger, LlamaStackClient
+import requests
+from io import BytesIO
+
+client = LlamaStackClient(base_url="http://localhost:8321")
+
+# Create vector store
+vs = client.vector_stores.create(name="my_vector_db")
+
+# Upload document
+url = "https://www.paulgraham.com/greatwork.html"
+response = requests.get(url)
+file_buffer = BytesIO(response.content)
+file_buffer.name = "greatwork.html"
+
+file = client.files.create(file=file_buffer, purpose="assistants")
+client.vector_stores.files.create(vector_store_id=vs.id, file_id=file.id)
+
+# Create agent with file_search tool (client-side wrapper)
+agent = Agent(
+    client,
+    model="ollama/llama3.2:3b",
+    instructions="You are a helpful assistant",
+    tools=[
+        {
+            "type": "file_search",
+            "vector_store_ids": [vs.id],  # Agent searches this automatically
+        }
+    ],
+)
+
+# Just ask - agent handles retrieval automatically
+response = agent.create_turn(
+    messages=[{"role": "user", "content": "How do you do great work?"}],
+    session_id=agent.create_session("my_session"),
+    stream=True,
+)
+
+for log in AgentEventLogger().log(response):
+    print(log, end="")
+```
+
+**How it works:**
+- Client-side `Agent` class wraps the Responses API
+- Agent automatically decides when to search the vector store
+- Uses internal Python API for vector search (no HTTP overhead)
+- Maintains conversation context across turns
+- Best for: Interactive applications, chatbots, multi-turn conversations
+
+#### Approach 2: Responses API

-In another terminal, use the standard OpenAI client with the Responses API:

 ```python
 import io, requests
@ -35,7 +92,7 @@ from openai import OpenAI
 url = "https://www.paulgraham.com/greatwork.html"
 client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")

-# Create vector store - auto-detects default embedding model
+# Create vector store
 vs = client.vector_stores.create()

 response = requests.get(url)
@ -43,17 +100,59 @@ pseudo_file = io.BytesIO(str(response.content).encode('utf-8'))
 file_id = client.files.create(file=(url, pseudo_file, "text/html"), purpose="assistants").id
 client.vector_stores.files.create(vector_store_id=vs.id, file_id=file_id)

+# Automatic tool calling (calls Responses API directly)
 resp = client.responses.create(
    model="gpt-4o",
-    input="How do you do great work? Use the existing knowledge_search tool.",
+    input="How do you do great work?",
    tools=[{"type": "file_search", "vector_store_ids": [vs.id]}],
    include=["file_search_call.results"],
 )

 print(resp.output[-1].content[-1].text)
 ```
-Which should give output like:
-```
+
+**How it works:**
+- Server-side API with automatic tool calling
+- Uses internal Python API for vector search
+- No built-in session management (stateless by default)
+- Best for: Single-turn queries, OpenAI-compatible applications
+
+#### Approach 3: Chat Completions API
+
+The **Chat Completions API** is a server-side API that gives you explicit control over retrieval and generation. Best for custom RAG pipelines and batch processing.
+
+```python
+import io, requests
+from openai import OpenAI
+
+client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")
+
+# Create vector store and add documents
+vs = client.vector_stores.create()
+# ... upload and add files ...
+
+# Explicitly search vector store via REST API
+query = "How do you do great work?"
+search_results = client.vector_stores.search(
+    vector_store_id=vs.id,
+    query=query,
+    limit=3
+)
+
+# Manually extract context
+context = "\n\n".join([r.content for r in search_results.data if r.content])
+
+# Manually construct prompt with context
+completion = client.chat.completions.create(
+    model="gpt-4o",
+    messages=[
+        {"role": "system", "content": "Use the provided context to answer questions."},
+        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
+    ]
+)
+
+print(completion.choices[0].message.content)
+
 Doing great work is about more than just hard work and ambition; it involves combining several elements:

 1. **Pursue What Excites You**: Engage in projects that are both ambitious and exciting to you. It's important to work on something you have a natural aptitude for and a deep interest in.
--- a/docs/docs/getting_started/demo_script.py
+++ b/docs/docs/getting_started/demo_script.py
@ -4,24 +4,132 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

+"""
+Demo script showing RAG with both Responses API and Chat Completions API.

-import io, requests
+This example demonstrates two approaches to RAG with Llama Stack:
+1. Responses API - Automatic agentic tool calling with file search
+2. Chat Completions API - Manual retrieval with explicit control
+
+Run this script after starting a Llama Stack server:
+    llama stack run starter
+"""
+
+import io
+
+import requests
 from openai import OpenAI

-url="https://www.paulgraham.com/greatwork.html"
+# Initialize OpenAI client pointing to Llama Stack server
 client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none")

+# Shared setup: Create vector store and upload document
+print("=" * 80)
+print("SETUP: Creating vector store and uploading document")
+print("=" * 80)
+
+url = "https://www.paulgraham.com/greatwork.html"
+print(f"Fetching document from: {url}")
+
 vs = client.vector_stores.create()
+print(f"Vector store created: {vs.id}")
+
 response = requests.get(url)
-pseudo_file = io.BytesIO(str(response.content).encode('utf-8'))
-uploaded_file = client.files.create(file=(url, pseudo_file, "text/html"), purpose="assistants")
+pseudo_file = io.BytesIO(str(response.content).encode("utf-8"))
+uploaded_file = client.files.create(
+    file=(url, pseudo_file, "text/html"), purpose="assistants"
+)
 client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file.id)
+print(f"File uploaded and added to vector store: {uploaded_file.id}")
+
+query = "How do you do great work?"
+
+# ============================================================================
+# APPROACH 1: Responses API (Recommended for most use cases)
+# ============================================================================
+print("\n" + "=" * 80)
+print("APPROACH 1: Responses API (Automatic Tool Calling)")
+print("=" * 80)
+print(f"Query: {query}\n")

 resp = client.responses.create(
-    model="openai/gpt-4o",
-    input="How do you do great work? Use the existing knowledge_search tool.",
+    model="ollama/llama3.2:3b",  # feel free to change this to any other model
+    input=query,
    tools=[{"type": "file_search", "vector_store_ids": [vs.id]}],
    include=["file_search_call.results"],
 )

-print(resp)
+print("Response (Responses API):")
+print("-" * 80)
+print(resp.output[-1].content[-1].text)
+print("-" * 80)
+
+# ============================================================================
+# APPROACH 2: Chat Completions API
+# ============================================================================
+print("\n" + "=" * 80)
+print("APPROACH 2: Chat Completions API (Manual Retrieval)")
+print("=" * 80)
+print(f"Query: {query}\n")
+
+# Step 1: Search vector store explicitly
+print("Searching vector store...")
+search_results = client.vector_stores.search(
+    vector_store_id=vs.id, query=query, max_num_results=3, rewrite_query=False
+)
+
+# Step 2: Extract context from search results
+context_chunks = []
+for result in search_results.data:
+    # result.content is a list of Content objects, extract the text from each
+    if hasattr(result, "content") and result.content:
+        for content_item in result.content:
+            if hasattr(content_item, "text") and content_item.text:
+                context_chunks.append(content_item.text)
+
+context = "\n\n".join(context_chunks)
+print(f"Found {len(context_chunks)} relevant chunks\n")
+
+# Step 3: Use Chat Completions with retrieved context
+print("Generating response with chat completions...")
+completion = client.chat.completions.create(
+    model="ollama/llama3.2:3b",  # Feel free to change this to any other model
+    messages=[
+        {
+            "role": "system",
+            "content": "You are a helpful assistant. Use the provided context to answer the user's question.",
+        },
+        {
+            "role": "user",
+            "content": f"Context:\n{context}\n\nQuestion: {query}\n\nPlease provide a comprehensive answer based on the context above.",
+        },
+    ],
+    temperature=0.7,
+)
+
+print("Response (Chat Completions API):")
+print("-" * 80)
+print(completion.choices[0].message.content)
+print("-" * 80)
+
+# ============================================================================
+# Summary
+# ============================================================================
+print("\n" + "=" * 80)
+print("SUMMARY")
+print("=" * 80)
+print(
+    """
+Both approaches successfully performed RAG:
+
+1. Responses API:
+   - Automatic tool calling (model decides when to search)
+   - Simpler code, less control
+   - Best for: Conversational agents, automatic workflows
+
+2. Chat Completions API:
+   - Manual retrieval (you control the search)
+   - More code, more control
+   - Best for: Custom RAG patterns, batch processing, specialized workflows
+"""
+)
--- a/docs/docs/references/python_sdk_reference/index.md
+++ b/docs/docs/references/python_sdk_reference/index.md
@ -220,6 +220,20 @@ Methods:

 ## VectorIo

+:::warning DEPRECATED API
+
+**This API is deprecated and will be removed in a future version.**
+
+Use the OpenAI-compatible [Vector Stores API](#vectorstores) instead:
+- Instead of `client.vector_io.insert()`, use `client.vector_stores.files.create()` and `client.vector_stores.files.chunks.create()`
+- Instead of `client.vector_io.query()`, use `client.vector_stores.search()`
+
+See the [RAG documentation](/docs/building_applications/rag) for migration examples.
+
+Related: [Issue #2981](https://github.com/meta-llama/llama-stack/issues/2981)
+
+:::
+
 Types:

 ```python
@ -233,6 +247,22 @@ Methods:

 ## VectorDBs

+:::warning DEPRECATED API
+
+**This API is deprecated and will be removed in a future version.**
+
+Use the OpenAI-compatible [Vector Stores API](#vectorstores) instead:
+- Instead of `client.vector_dbs.register()`, use `client.vector_stores.create()`
+- Instead of `client.vector_dbs.list()`, use `client.vector_stores.list()`
+- Instead of `client.vector_dbs.retrieve()`, use `client.vector_stores.retrieve()`
+- Instead of `client.vector_dbs.unregister()`, use `client.vector_stores.delete()`
+
+See the [RAG documentation](/docs/building_applications/rag) for migration examples.
+
+Related: [Issue #2981](https://github.com/meta-llama/llama-stack/issues/2981)
+
+:::
+
 Types:

 ```python
--- a/docs/quick_start.ipynb
+++ b/docs/quick_start.ipynb
@ -2,7 +2,6 @@
 "cells": [
  {
   "cell_type": "markdown",
-   "id": "c1e7571c",
   "metadata": {
    "id": "c1e7571c"
   },
@ -23,7 +22,6 @@
  },
  {
   "cell_type": "markdown",
-   "id": "4CV1Q19BDMVw",
   "metadata": {
    "id": "4CV1Q19BDMVw"
   },
@ -33,7 +31,6 @@
  },
  {
   "cell_type": "markdown",
-   "id": "K4AvfUAJZOeS",
   "metadata": {
    "id": "K4AvfUAJZOeS"
   },
@ -46,7 +43,6 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "7a2d7b85",
   "metadata": {},
   "outputs": [],
   "source": [
@ -61,7 +57,6 @@
  },
  {
   "cell_type": "markdown",
-   "id": "39fa584b",
   "metadata": {},
   "source": [
    "### 1.2. Test inference with Ollama"
@ -69,7 +64,6 @@
  },
  {
   "cell_type": "markdown",
-   "id": "3bf81522",
   "metadata": {},
   "source": [
    "We’ll now launch a terminal and run inference on a Llama model with Ollama to verify that the model is working correctly."
@ -78,7 +72,6 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "a7e8e0f1",
   "metadata": {},
   "outputs": [],
   "source": [
@ -92,7 +85,6 @@
  },
  {
   "cell_type": "markdown",
-   "id": "f3c5f243",
   "metadata": {},
   "source": [
    "If successful, you should see the model respond to a prompt.\n",
@ -106,7 +98,6 @@
  },
  {
   "cell_type": "markdown",
-   "id": "oDUB7M_qe-Gs",
   "metadata": {
    "id": "oDUB7M_qe-Gs"
   },
@ -118,7 +109,6 @@
  },
  {
   "cell_type": "markdown",
-   "id": "732eadc6",
   "metadata": {},
   "source": [
    "### 2.1. Setup the Llama Stack Server"
@ -127,7 +117,6 @@
  {
   "cell_type": "code",
   "execution_count": 1,
-   "id": "J2kGed0R5PSf",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
@ -206,7 +195,6 @@
  },
  {
   "cell_type": "markdown",
-   "id": "c40e9efd",
   "metadata": {},
   "source": [
    "### 2.2. Start the Llama Stack Server"
@ -215,7 +203,6 @@
  {
   "cell_type": "code",
   "execution_count": 2,
-   "id": "f779283d",
   "metadata": {},
   "outputs": [
    {
@ -235,26 +222,76 @@
  },
  {
   "cell_type": "markdown",
-   "id": "28477c03",
   "metadata": {},
   "source": [
-    "## Step 3: Run the demo"
+    "## Step 3: RAG Demos - Three Approaches\n",
+    "\n",
+    "We'll demonstrate three different approaches to building RAG applications with Llama Stack:\n",
+    "1. **Agent API** - High-level agent with session management\n",
+    "2. **Responses API** - Direct OpenAI-compatible responses\n",
+    "3. **Chat Completions API** - Manual retrieval with explicit control\n",
+    "\n",
+    "### Approach 1: Agent Class (High-level)"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 3,
-   "id": "7da71011",
+   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
-      "INFO:httpx:HTTP Request: GET http://0.0.0.0:8321/v1/models \"HTTP/1.1 200 OK\"\n",
-      "INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/files \"HTTP/1.1 200 OK\"\n",
+      "INFO:httpx:HTTP Request: GET http://0.0.0.0:8321/v1/models \"HTTP/1.1 200 OK\"\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Available models: ['bedrock/meta.llama3-1-405b-instruct-v1:0', 'bedrock/meta.llama3-1-70b-instruct-v1:0', 'bedrock/meta.llama3-1-8b-instruct-v1:0', 'ollama/chevalblanc/gpt-4o-mini:latest', 'ollama/nomic-embed-text:latest', 'ollama/llama3.3:70b', 'ollama/llama3.2:3b', 'ollama/all-minilm:l6-v2', 'ollama/llama3.1:8b', 'ollama/llama-guard3:latest', 'ollama/llama-guard3:8b', 'ollama/shieldgemma:27b', 'ollama/shieldgemma:latest', 'ollama/llama3.1:8b-instruct-fp16', 'ollama/all-minilm:latest', 'ollama/llama3.2:3b-instruct-fp16', 'sentence-transformers/nomic-ai/nomic-embed-text-v1.5']\n",
+      "✓ Using model: ollama/llama3.3:70b\n",
+      "\n",
+      "✓ Downloading and indexing Paul Graham's essay...\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/files \"HTTP/1.1 200 OK\"\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✓ File created with ID: file-e1290f8be28245e681bdfa5c40a7e7c4\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
      "INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/vector_stores \"HTTP/1.1 200 OK\"\n",
-      "INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/conversations \"HTTP/1.1 200 OK\"\n",
+      "INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/conversations \"HTTP/1.1 200 OK\"\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "✓ Vector store created with ID: vs_67efaaf4-ba0d-4037-b816-f73d588e9e4d\n",
+      "✓ Agent created\n",
+      "\n",
+      "prompt> How do you do great work?\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
      "INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/responses \"HTTP/1.1 200 OK\"\n"
     ]
    },
@ -262,57 +299,68 @@
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "prompt> How do you do great work?\n",
-      "🤔 Doing great work involves a combination of skills, habits, and mindsets. Here are some key principles:\n",
+      "🤔 \n",
      "\n",
-      "1. **Set Clear Goals**: Start with a clear vision of what you want to achieve. Define specific, measurable, achievable, relevant, and time-bound (SMART) goals.\n",
-      "\n",
-      "2. **Plan and Prioritize**: Break your goals into smaller, manageable tasks. Prioritize these tasks based on their importance and urgency.\n",
-      "\n",
-      "3. **Focus on Quality**: Aim for high-quality outcomes rather than just finishing tasks. Pay attention to detail, and ensure your work meets or exceeds standards.\n",
-      "\n",
-      "4. **Stay Organized**: Keep your workspace, both physical and digital, organized to help you stay focused and efficient.\n",
-      "\n",
-      "5. **Manage Your Time**: Use time management techniques such as the Pomodoro Technique, time blocking, or the Eisenhower Box to maximize productivity.\n",
-      "\n",
-      "6. **Seek Feedback and Learn**: Regularly seek feedback from peers, mentors, or supervisors. Use constructive criticism to improve continuously.\n",
-      "\n",
-      "7. **Innovate and Improve**: Look for ways to improve processes or introduce new ideas. Be open to change and willing to adapt.\n",
-      "\n",
-      "8. **Stay Motivated and Persistent**: Keep your end goals in mind to stay motivated. Overcome setbacks with resilience and persistence.\n",
-      "\n",
-      "9. **Balance and Rest**: Ensure you maintain a healthy work-life balance. Take breaks and manage stress to sustain long-term productivity.\n",
-      "\n",
-      "10. **Reflect and Adjust**: Regularly assess your progress and adjust your strategies as needed. Reflect on what works well and what doesn't.\n",
-      "\n",
-      "By incorporating these elements, you can consistently produce high-quality work and achieve excellence in your endeavors.\n"
+      "🔧 Executing file_search (server-side)...\n",
+      "🤔 To do great work it's essential to decide what to work on and choose something you have a natural aptitude for that you are deeply interested in and offers scope to do great work <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. Develop a habit of working on your own projects and don't let \"work\" mean something other people tell you to do <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. Seek out the best colleagues as they can encourage you and help bounce ideas off each other and it's better to have one or two great ones than a building full of pretty good ones <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. Husband your morale as it's crucial for doing great work and try to learn about other kinds of work by taking ideas from distant fields if you let them be metaphors <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. Negative examples can also be inspiring so try to learn from things done badly as sometimes it becomes clear what's needed when it's missing <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. If you're earnest you'll probably get a warmer welcome than expected when visiting places with the best people in your field which can increase your ambition and self-confidence <|file-e1290f8be28245e681bdfa5c40a7e7c4|>.\n"
     ]
    }
   ],
   "source": [
-    "from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient\n",
+    "\n",
+    "# Make sure that your llama stack client version matches with the llama stack server version you are using.\n",
+    "from llama_stack_client import Agent, AgentEventLogger, LlamaStackClient\n",
    "import requests\n",
+    "from io import BytesIO\n",
+    "\n",
+    "\n",
    "\n",
    "vector_store_id = \"my_demo_vector_db\"\n",
    "client = LlamaStackClient(base_url=\"http://0.0.0.0:8321\")\n",
    "\n",
-    "models = client.models.list()\n",
+    "# Get model - find any Ollama Llama model\n",
+    "models = list(client.models.list())\n",
+    "print(f\"Available models: {[m.id for m in models]}\")\n",
    "\n",
-    "# Select the first ollama and first ollama's embedding model\n",
-    "model_id = next(m for m in models if m.model_type == \"llm\" and m.provider_id == \"ollama\").identifier\n",
+    "# Find any Ollama Llama LLM model\n",
+    "model_id = None\n",
+    "priority_models = [\"ollama/llama3.3:70b\",\"ollama/llama3.2:3b\",\"ollama/llama3.1:8b\"]\n",
+    "for m in models:\n",
+    "    if hasattr(m, \"custom_metadata\") and m.custom_metadata:\n",
+    "        provider_id = m.custom_metadata.get(\"provider_id\")\n",
+    "        model_type = m.custom_metadata.get(\"model_type\")\n",
    "\n",
+    "        # Use any Ollama LLM model with \"llama\" in the name\n",
+    "        if provider_id == \"ollama\" and model_type == \"llm\" and m.id.lower() in priority_models:\n",
+    "            model_id = m.id\n",
+    "            print(f\"✓ Using model: {model_id}\")\n",
+    "            break\n",
    "\n",
+    "if not model_id:\n",
+    "    raise ValueError(\"No Ollama Llama model found\")\n",
+    "\n",
+    "# Create vector store\n",
+    "print(\"\\n✓ Downloading and indexing Paul Graham's essay...\")\n",
    "source = \"https://www.paulgraham.com/greatwork.html\"\n",
    "response = requests.get(source)\n",
+    "\n",
+    "# Create a file-like object from the HTML content\n",
+    "file_buffer = BytesIO(response.content)\n",
+    "file_buffer.name = \"greatwork.html\"\n",
+    "\n",
    "file = client.files.create(\n",
-    "    file=response.content,\n",
+    "    file=file_buffer,\n",
    "    purpose='assistants'\n",
    ")\n",
+    "print(f\"✓ File created with ID: {file.id}\")\n",
+    "\n",
    "vector_store = client.vector_stores.create(\n",
    "    name=vector_store_id,\n",
    "    file_ids=[file.id],\n",
    ")\n",
+    "print(f\"✓ Vector store created with ID: {vector_store.id}\")\n",
    "\n",
+    "# Create agent\n",
    "agent = Agent(\n",
    "    client,\n",
    "    model=model_id,\n",
@ -320,13 +368,14 @@
    "    tools=[\n",
    "        {\n",
    "            \"type\": \"file_search\",\n",
-    "            \"vector_store_ids\": [vector_store_id],\n",
+    "            \"vector_store_ids\": [vector_store.id],  # Use the actual ID, not the name\n",
    "        }\n",
    "    ],\n",
    ")\n",
+    "print(\"✓ Agent created\")\n",
    "\n",
    "prompt = \"How do you do great work?\"\n",
-    "print(\"prompt>\", prompt)\n",
+    "print(\"\\nprompt>\", prompt)\n",
    "\n",
    "response = agent.create_turn(\n",
    "    messages=[{\"role\": \"user\", \"content\": prompt}],\n",
@ -340,15 +389,283 @@
  },
  {
   "cell_type": "markdown",
-   "id": "341aaadf",
   "metadata": {},
   "source": [
-    "Congratulations! You've successfully built your first RAG application using Llama Stack! 🎉🥳"
+    "#### Multi-turn RAG Conversation with Session Management"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/conversations \"HTTP/1.1 200 OK\"\n",
+      "INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/responses \"HTTP/1.1 200 OK\"\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "================================================================================\n",
+      "Multi-turn RAG Conversation Demo\n",
+      "================================================================================\n",
+      "Demonstrating: Session maintains context while agent searches document\n",
+      "================================================================================\n",
+      "\n",
+      "[Turn 1] User: What does the document say about curiosity and great work?\n",
+      "(Agent will search the document...)\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/responses \"HTTP/1.1 200 OK\"\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Assistant: 🤔 \n",
+      "\n",
+      "🔧 Executing file_search (server-side)...\n",
+      "🤔 Curiosity is a key factor in doing great work <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. It drives people to learn and explore new ideas, which can lead to innovative solutions and discoveries <|file-e12...\n",
+      "\n",
+      "[Turn 2] User: Why is that important?\n",
+      "(Agent remembers 'that' refers to curiosity from Turn 1 - no need to search again)\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/responses \"HTTP/1.1 200 OK\"\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Assistant: 🤔 \n",
+      "\n",
+      "🔧 Executing file_search (server-side)...\n",
+      "🤔 Curiosity plays a crucial role in driving individuals to do great work and make meaningful discoveries <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. It is the key to all four steps in doing great work: choo...\n",
+      "\n",
+      "[Turn 3] User: What about the role of ambition?\n",
+      "(New topic - agent will search document again for 'ambition')\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/responses \"HTTP/1.1 200 OK\"\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Assistant: 🤔 \n",
+      "\n",
+      "🔧 Executing file_search (server-side)...\n",
+      "🤔 To do great work it's an advantage to be optimistic even though that means you'll risk looking like a fool sometimes <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. One way to avoid intellectual dishonesty is...\n",
+      "\n",
+      "[Turn 4] User: How do curiosity and ambition work together?\n",
+      "(Agent combines information from Turn 1 and Turn 3 using session context)\n",
+      "\n",
+      "Assistant: 🤔 \n",
+      "\n",
+      "🔧 Executing file_search (server-side)...\n",
+      "🤔 Curiosity and ambition are closely related as they both drive individuals to achieve great work <|file-e1290f8be28245e681bdfa5c40a7e7c4|>. Developing curiosity is essential for doing great work, and it c...\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Create a new session for multi-turn RAG conversation\n",
+    "session_id = agent.create_session(\"multi_turn_rag_session\")\n",
+    "\n",
+    "print(\"\\n\" + \"=\"*80)\n",
+    "print(\"Multi-turn RAG Conversation Demo\")\n",
+    "print(\"=\"*80)\n",
+    "print(\"Demonstrating: Session maintains context while agent searches document\")\n",
+    "print(\"=\"*80)\n",
+    "\n",
+    "# Turn 1: Initial question - Agent searches document for relevant information\n",
+    "print(\"\\n[Turn 1] User: What does the document say about curiosity and great work?\")\n",
+    "print(\"(Agent will search the document...)\")\n",
+    "response1 = agent.create_turn(\n",
+    "    messages=[{\"role\": \"user\", \"content\": \"What does the document say about curiosity and great work?\"}],\n",
+    "    session_id=session_id,\n",
+    "    stream=True,  # Use streaming for reliability\n",
+    ")\n",
+    "# Collect the response\n",
+    "response1_text = \"\"\n",
+    "for log in AgentEventLogger().log(response1):\n",
+    "    response1_text += log\n",
+    "print(\"\\nAssistant:\", response1_text[:250] + \"...\\n\")\n",
+    "\n",
+    "# Turn 2: Follow-up question using pronouns - Agent remembers the context from Turn 1\n",
+    "print(\"[Turn 2] User: Why is that important?\")\n",
+    "print(\"(Agent remembers 'that' refers to curiosity from Turn 1 - no need to search again)\")\n",
+    "response2 = agent.create_turn(\n",
+    "    messages=[{\"role\": \"user\", \"content\": \"Why is that important?\"}],\n",
+    "    session_id=session_id,\n",
+    "    stream=True,  # Use streaming for reliability\n",
+    ")\n",
+    "response2_text = \"\"\n",
+    "for log in AgentEventLogger().log(response2):\n",
+    "    response2_text += log\n",
+    "print(\"\\nAssistant:\", response2_text[:250] + \"...\\n\")\n",
+    "\n",
+    "# Turn 3: New question on different topic - Agent performs new document search\n",
+    "print(\"[Turn 3] User: What about the role of ambition?\")\n",
+    "print(\"(New topic - agent will search document again for 'ambition')\")\n",
+    "response3 = agent.create_turn(\n",
+    "    messages=[{\"role\": \"user\", \"content\": \"What about the role of ambition?\"}],\n",
+    "    session_id=session_id,\n",
+    "    stream=True,  # Use streaming for reliability\n",
+    ")\n",
+    "response3_text = \"\"\n",
+    "for log in AgentEventLogger().log(response3):\n",
+    "    response3_text += log\n",
+    "print(\"\\nAssistant:\", response3_text[:250] + \"...\\n\")\n",
+    "\n",
+    "# Turn 4: Compare previous topics - Agent uses session memory\n",
+    "print(\"[Turn 4] User: How do curiosity and ambition work together?\")\n",
+    "print(\"(Agent combines information from Turn 1 and Turn 3 using session context)\")\n",
+    "response4 = agent.create_turn(\n",
+    "    messages=[{\"role\": \"user\", \"content\": \"How do curiosity and ambition work together?\"}],\n",
+    "    session_id=session_id,\n",
+    "    stream=True,  # Use streaming for reliability\n",
+    ")\n",
+    "response4_text = \"\"\n",
+    "for log in AgentEventLogger().log(response4):\n",
+    "    response4_text += log\n",
+    "print(\"\\nAssistant:\", response4_text[:250] + \"...\\n\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Approach 3: Chat Completions API"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/vector_stores/vs_67efaaf4-ba0d-4037-b816-f73d588e9e4d/search \"HTTP/1.1 200 OK\"\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "User Query: What does paul graham say about curiosity and great work?\n",
+      "\n",
+      "Searching vector store...\n",
+      "Using vector store ID: vs_67efaaf4-ba0d-4037-b816-f73d588e9e4d\n",
+      "Extracting context from search results...\n",
+      "Found 3 relevant chunks\n",
+      "\n",
+      "Response (Chat Completions API):\n",
+      "================================================================================\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "INFO:httpx:HTTP Request: POST http://0.0.0.0:8321/v1/chat/completions \"HTTP/1.1 200 OK\"\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "According to Paul Graham, curiosity is a crucial factor in doing great work. He emphasizes that curiosity is the best guide for finding something worth working on, and it plays a significant role in all four steps of doing great work: choosing a field, getting to the frontier, noticing gaps, and exploring them.\n",
+      "\n",
+      "Graham notes that curiosity is not something that can be commanded, but it can be nurtured and allowed to drive one's efforts. He suggests that curious people are more likely to find the right thing to work on in the first place, as they cast a wide net and are more likely to stumble upon something important.\n",
+      "\n",
+      "Graham also highlights the importance of curiosity in overcoming obstacles and staying motivated. He argues that when working on something that sparks genuine curiosity, the work will feel less burdensome, even if it's challenging. This is because curious people are driven by a desire to learn and understand, rather than just seeking external validation or rewards.\n",
+      "\n",
+      "Furthermore, Graham emphasizes that curiosity is a key factor in distinguishing between great work and mediocre work. He notes that people who are truly curious about their work are more likely to produce something original and innovative, whereas those who lack curiosity may simply be going through the motions.\n",
+      "\n",
+      "In fact, Graham is so convinced of the importance of curiosity that he suggests it might be the single most important factor in doing great work. He even goes so far as to say that if an oracle were to give a single-word answer to the question of how to do great work, it would be \"curiosity.\"\n",
+      "\n",
+      "Overall, Paul Graham's writings suggest that curiosity is essential for doing great work, and that it plays a central role in driving innovation, creativity, and progress. By nurturing curiosity and allowing it to guide their efforts, individuals can increase their chances of producing something truly remarkable and making a meaningful contribution to their field.\n",
+      "================================================================================\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Step 1: Search vector store explicitly\n",
+    "prompt = \"What does paul graham say about curiosity and great work?\"\n",
+    "print(f\"User Query: {prompt}\")\n",
+    "print(\"\\nSearching vector store...\")\n",
+    "print(f\"Using vector store ID: {vector_store.id}\")\n",
+    "search_results = client.vector_stores.search(\n",
+    "    vector_store_id=vector_store.id,  # Use the actual ID, not the name\n",
+    "    query=prompt,\n",
+    "    max_num_results=3,\n",
+    "    rewrite_query=False\n",
+    ")\n",
+    "\n",
+    "# Step 2: Extract context from search results\n",
+    "print(\"Extracting context from search results...\")\n",
+    "context_chunks = []\n",
+    "for result in search_results.data:\n",
+    "    if hasattr(result, \"content\") and result.content:\n",
+    "        for content_item in result.content:\n",
+    "            if hasattr(content_item, \"text\") and content_item.text:\n",
+    "                context_chunks.append(content_item.text)\n",
+    "\n",
+    "context = \"\\n\\n\".join(context_chunks)\n",
+    "print(f\"Found {len(context_chunks)} relevant chunks\\n\")\n",
+    "\n",
+    "# Step 3: Use Chat Completions with retrieved context\n",
+    "print(\"Response (Chat Completions API):\")\n",
+    "print(\"=\"*80)\n",
+    "\n",
+    "completion = client.chat.completions.create(\n",
+    "    model=model_id,\n",
+    "    messages=[\n",
+    "        {\n",
+    "            \"role\": \"system\",\n",
+    "            \"content\": \"You are a helpful assistant. Use the provided context to answer the user's question.\",\n",
+    "        },\n",
+    "        {\n",
+    "            \"role\": \"user\",\n",
+    "            \"content\": f\"Context:\\n{context}\\n\\nQuestion: {prompt}\\n\\nPlease provide a comprehensive answer based on the context above.\",\n",
+    "        },\n",
+    "    ],\n",
+    "    temperature=0.7,\n",
+    ")\n",
+    "\n",
+    "print(completion.choices[0].message.content)\n",
+    "print(\"=\"*80)"
   ]
  },
  {
   "cell_type": "markdown",
-   "id": "e88e1185",
   "metadata": {},
   "source": [
    "## Next Steps"
@ -356,7 +673,6 @@
  },
  {
   "cell_type": "markdown",
-   "id": "bcb73600",
   "metadata": {},
   "source": [
    "Now you're ready to dive deeper into Llama Stack!\n",
@ -376,6 +692,9 @@
   "gpuType": "T4",
   "provenance": []
  },
+  "fileHeader": "",
+  "fileUid": "92b9a30f-53a7-4b4f-8cfa-0fed1619256f",
+  "isAdHoc": false,
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
@ -391,9 +710,9 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.12.12"
+   "version": "3.12.7"
  }
 },
 "nbformat": 4,
- "nbformat_minor": 5
+ "nbformat_minor": 4
 }