Merge branch 'main' into content-extension

2025-12-18 04:49:54 +00:00 · 2025-08-25 14:22:15 -06:00 · 2025-08-25 14:22:15 -06:00 · 3e11e1472c
commit 3e11e1472c
parent 84a26339c8 eed25fc6e4
334 changed files with 22841 additions and 8940 deletions
--- a/tests/integration/README.md
+++ b/tests/integration/README.md
@ -1,6 +1,20 @@
-# Llama Stack Integration Tests
+# Integration Testing Guide

-We use `pytest` for parameterizing and running tests. You can see all options with:
+Integration tests verify complete workflows across different providers using Llama Stack's record-replay system.
+
+## Quick Start
+
+```bash
+# Run all integration tests with existing recordings
+LLAMA_STACK_TEST_INFERENCE_MODE=replay \
+  LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
+  uv run --group test \
+  pytest -sv tests/integration/ --stack-config=starter
+```
+
+## Configuration Options
+
+You can see all options with:
 ```bash
 cd tests/integration

@ -10,11 +24,11 @@ pytest --help

 Here are the most important options:
 - `--stack-config`: specify the stack config to use. You have four ways to point to a stack:
-  - **`server:<config>`** - automatically start a server with the given config (e.g., `server:fireworks`). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
-  - **`server:<config>:<port>`** - same as above but with a custom port (e.g., `server:together:8322`)
+  - **`server:<config>`** - automatically start a server with the given config (e.g., `server:starter`). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
+  - **`server:<config>:<port>`** - same as above but with a custom port (e.g., `server:starter:8322`)
  - a URL which points to a Llama Stack distribution server
-  - a template (e.g., `starter`) or a path to a `run.yaml` file
-  - a comma-separated list of api=provider pairs, e.g. `inference=fireworks,safety=llama-guard,agents=meta-reference`. This is most useful for testing a single API surface.
+  - a distribution name (e.g., `starter`) or a path to a `run.yaml` file
+  - a comma-separated list of api=provider pairs, e.g. `inference=ollama,safety=llama-guard,agents=meta-reference`. This is most useful for testing a single API surface.
 - `--env`: set environment variables, e.g. --env KEY=value. this is a utility option to set environment variables required by various providers.

 Model parameters can be influenced by the following options:
@ -32,83 +46,139 @@ if no model is specified.

 ### Testing against a Server

-Run all text inference tests by auto-starting a server with the `fireworks` config:
+Run all text inference tests by auto-starting a server with the `starter` config:

 ```bash
-pytest -s -v tests/integration/inference/test_text_inference.py \
-   --stack-config=server:fireworks \
-   --text-model=meta-llama/Llama-3.1-8B-Instruct
+OLLAMA_URL=http://localhost:11434 \
+  pytest -s -v tests/integration/inference/test_text_inference.py \
+   --stack-config=server:starter \
+   --text-model=ollama/llama3.2:3b-instruct-fp16 \
+   --embedding-model=sentence-transformers/all-MiniLM-L6-v2
 ```

 Run tests with auto-server startup on a custom port:

 ```bash
-pytest -s -v tests/integration/inference/ \
-   --stack-config=server:together:8322 \
-   --text-model=meta-llama/Llama-3.1-8B-Instruct
-```
-
-Run multiple test suites with auto-server (eliminates manual server management):
-
-```bash
-# Auto-start server and run all integration tests
-export FIREWORKS_API_KEY=<your_key>
-
-pytest -s -v tests/integration/inference/ tests/integration/safety/ tests/integration/agents/ \
-   --stack-config=server:fireworks \
-   --text-model=meta-llama/Llama-3.1-8B-Instruct
+OLLAMA_URL=http://localhost:11434 \
+  pytest -s -v tests/integration/inference/ \
+   --stack-config=server:starter:8322 \
+   --text-model=ollama/llama3.2:3b-instruct-fp16 \
+   --embedding-model=sentence-transformers/all-MiniLM-L6-v2
 ```

 ### Testing with Library Client

-Run all text inference tests with the `starter` distribution using the `together` provider:
+The library client constructs the Stack "in-process" instead of using a server. This is useful during the iterative development process since you don't need to constantly start and stop servers.
+
+
+You can do this by simply using `--stack-config=starter` instead of `--stack-config=server:starter`.
+
+
+### Using ad-hoc distributions
+
+Sometimes, you may want to make up a distribution on the fly. This is useful for testing a single provider or a single API or a small combination of providers. You can do so by specifying a comma-separated list of api=provider pairs to the `--stack-config` option, e.g. `inference=remote::ollama,safety=inline::llama-guard,agents=inline::meta-reference`.

 ```bash
-ENABLE_TOGETHER=together pytest -s -v tests/integration/inference/test_text_inference.py \
-   --stack-config=starter \
-   --text-model=meta-llama/Llama-3.1-8B-Instruct
-```
-
-Run all text inference tests with the `starter` distribution using the `together` provider and `meta-llama/Llama-3.1-8B-Instruct`:
-
-```bash
-ENABLE_TOGETHER=together pytest -s -v tests/integration/inference/test_text_inference.py \
-   --stack-config=starter \
-   --text-model=meta-llama/Llama-3.1-8B-Instruct
-```
-
-Running all inference tests for a number of models using the `together` provider:
-
-```bash
-TEXT_MODELS=meta-llama/Llama-3.1-8B-Instruct,meta-llama/Llama-3.1-70B-Instruct
-VISION_MODELS=meta-llama/Llama-3.2-11B-Vision-Instruct
-EMBEDDING_MODELS=all-MiniLM-L6-v2
-ENABLE_TOGETHER=together
-export TOGETHER_API_KEY=<together_api_key>
-
 pytest -s -v tests/integration/inference/ \
-   --stack-config=together \
+   --stack-config=inference=remote::ollama,safety=inline::llama-guard,agents=inline::meta-reference \
   --text-model=$TEXT_MODELS \
   --vision-model=$VISION_MODELS \
   --embedding-model=$EMBEDDING_MODELS
 ```

-Same thing but instead of using the distribution, use an adhoc stack with just one provider (`fireworks` for inference):
-
-```bash
-export FIREWORKS_API_KEY=<fireworks_api_key>
-
-pytest -s -v tests/integration/inference/ \
-   --stack-config=inference=fireworks \
-   --text-model=$TEXT_MODELS \
-   --vision-model=$VISION_MODELS \
-   --embedding-model=$EMBEDDING_MODELS
-```
-
-Running Vector IO tests for a number of embedding models:
+Another example: Running Vector IO tests for embedding models:

 ```bash
 uv run pytest -sv --stack-config="inference=inline::sentence-transformers,vector_io=inline::sqlite-vec,files=localfs" \
 tests/integration/vector_io --embedding-model \
 sentence-transformers/all-MiniLM-L6-v2
 ```
+
+## Recording Modes
+
+The testing system supports three modes controlled by environment variables:
+
+### LIVE Mode (Default)
+Tests make real API calls:
+```bash
+LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/
+```
+
+### RECORD Mode
+Captures API interactions for later replay:
+```bash
+LLAMA_STACK_TEST_INFERENCE_MODE=record \
+LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
+pytest tests/integration/inference/test_new_feature.py
+```
+
+### REPLAY Mode
+Uses cached responses instead of making API calls:
+```bash
+LLAMA_STACK_TEST_INFERENCE_MODE=replay \
+LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
+pytest tests/integration/
+```
+
+Note that right now you must specify the recording directory. This is because different tests use different recording directories and we don't (yet) have a fool-proof way to map a test to a recording directory. We are working on this.
+
+## Managing Recordings
+
+### Viewing Recordings
+```bash
+# See what's recorded
+sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings;"
+
+# Inspect specific response
+cat recordings/responses/abc123.json | jq '.'
+```
+
+### Re-recording Tests
+
+#### Remote Re-recording (Recommended)
+Use the automated workflow script for easier re-recording:
+```bash
+./scripts/github/schedule-record-workflow.sh --test-subdirs "inference,agents"
+```
+See the [main testing guide](../README.md#remote-re-recording-recommended) for full details.
+
+#### Local Re-recording
+```bash
+# Re-record specific tests
+LLAMA_STACK_TEST_INFERENCE_MODE=record \
+LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
+pytest -s -v --stack-config=server:starter tests/integration/inference/test_modified.py
+```
+
+Note that when re-recording tests, you must use a Stack pointing to a server (i.e., `server:starter`). This subtlety exists because the set of tests run in server are a superset of the set of tests run in the library client.
+
+## Writing Tests
+
+### Basic Test Pattern
+```python
+def test_basic_completion(llama_stack_client, text_model_id):
+    response = llama_stack_client.inference.completion(
+        model_id=text_model_id,
+        content=CompletionMessage(role="user", content="Hello"),
+    )
+
+    # Test structure, not AI output quality
+    assert response.completion_message is not None
+    assert isinstance(response.completion_message.content, str)
+    assert len(response.completion_message.content) > 0
+```
+
+### Provider-Specific Tests
+```python
+def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
+    if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
+        pytest.skip(f"Model {embedding_model_id} doesn't support task types")
+
+    query_response = llama_stack_client.inference.embeddings(
+        model_id=embedding_model_id,
+        contents=["What is machine learning?"],
+        task_type="query",
+    )
+
+    assert query_response.embeddings is not None
+```
--- a/tests/integration/agents/test_agents.py
+++ b/tests/integration/agents/test_agents.py
@ -133,24 +133,15 @@ def test_agent_simple(llama_stack_client, agent_config):
        assert "I can't" in logs_str


+@pytest.mark.skip(reason="this test was disabled for a long time, and now has turned flaky")
 def test_agent_name(llama_stack_client, text_model_id):
    agent_name = f"test-agent-{uuid4()}"
-
-    try:
-        agent = Agent(
-            llama_stack_client,
-            model=text_model_id,
-            instructions="You are a helpful assistant",
-            name=agent_name,
-        )
-    except TypeError:
-        agent = Agent(
-            llama_stack_client,
-            model=text_model_id,
-            instructions="You are a helpful assistant",
-        )
-        return
-
+    agent = Agent(
+        llama_stack_client,
+        model=text_model_id,
+        instructions="You are a helpful assistant",
+        name=agent_name,
+    )
    session_id = agent.create_session(f"test-session-{uuid4()}")

    agent.create_turn(
--- a/tests/integration/batches/init.py
+++ b/tests/integration/batches/init.py
@ -0,0 +1,5 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
--- a/tests/integration/batches/conftest.py
+++ b/tests/integration/batches/conftest.py
@ -0,0 +1,122 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+"""Shared pytest fixtures for batch tests."""
+
+import json
+import time
+import warnings
+from contextlib import contextmanager
+from io import BytesIO
+
+import pytest
+
+from llama_stack.apis.files import OpenAIFilePurpose
+
+
+class BatchHelper:
+    """Helper class for creating and managing batch input files."""
+
+    def __init__(self, client):
+        """Initialize with either a batch_client or openai_client."""
+        self.client = client
+
+    @contextmanager
+    def create_file(self, content: str | list[dict], filename_prefix="batch_input"):
+        """Context manager for creating and cleaning up batch input files.
+
+        Args:
+            content: Either a list of batch request dictionaries or raw string content
+            filename_prefix: Prefix for the generated filename (or full filename if content is string)
+
+        Yields:
+            The uploaded file object
+        """
+        if isinstance(content, str):
+            # Handle raw string content (e.g., malformed JSONL, empty files)
+            file_content = content.encode("utf-8")
+        else:
+            # Handle list of batch request dictionaries
+            jsonl_content = "\n".join(json.dumps(req) for req in content)
+            file_content = jsonl_content.encode("utf-8")
+
+        filename = filename_prefix if filename_prefix.endswith(".jsonl") else f"{filename_prefix}.jsonl"
+
+        with BytesIO(file_content) as file_buffer:
+            file_buffer.name = filename
+            uploaded_file = self.client.files.create(file=file_buffer, purpose=OpenAIFilePurpose.BATCH)
+
+        try:
+            yield uploaded_file
+        finally:
+            try:
+                self.client.files.delete(uploaded_file.id)
+            except Exception:
+                warnings.warn(
+                    f"Failed to cleanup file {uploaded_file.id}: {uploaded_file.filename}",
+                    stacklevel=2,
+                )
+
+    def wait_for(
+        self,
+        batch_id: str,
+        max_wait_time: int = 60,
+        sleep_interval: int | None = None,
+        expected_statuses: set[str] | None = None,
+        timeout_action: str = "fail",
+    ):
+        """Wait for a batch to reach a terminal status.
+
+        Args:
+            batch_id: The batch ID to monitor
+            max_wait_time: Maximum time to wait in seconds (default: 60 seconds)
+            sleep_interval: Time to sleep between checks in seconds (default: 1/10th of max_wait_time, min 1s, max 15s)
+            expected_statuses: Set of expected terminal statuses (default: {"completed"})
+            timeout_action: Action on timeout - "fail" (pytest.fail) or "skip" (pytest.skip)
+
+        Returns:
+            The final batch object
+
+        Raises:
+            pytest.Failed: If batch reaches an unexpected status or timeout_action is "fail"
+            pytest.Skipped: If timeout_action is "skip" on timeout or unexpected status
+        """
+        if sleep_interval is None:
+            # Default to 1/10th of max_wait_time, with min 1s and max 15s
+            sleep_interval = max(1, min(15, max_wait_time // 10))
+
+        if expected_statuses is None:
+            expected_statuses = {"completed"}
+
+        terminal_statuses = {"completed", "failed", "cancelled", "expired"}
+        unexpected_statuses = terminal_statuses - expected_statuses
+
+        start_time = time.time()
+        while time.time() - start_time < max_wait_time:
+            current_batch = self.client.batches.retrieve(batch_id)
+
+            if current_batch.status in expected_statuses:
+                return current_batch
+            elif current_batch.status in unexpected_statuses:
+                error_msg = f"Batch reached unexpected status: {current_batch.status}"
+                if timeout_action == "skip":
+                    pytest.skip(error_msg)
+                else:
+                    pytest.fail(error_msg)
+
+            time.sleep(sleep_interval)
+
+        timeout_msg = f"Batch did not reach expected status {expected_statuses} within {max_wait_time} seconds"
+        if timeout_action == "skip":
+            pytest.skip(timeout_msg)
+        else:
+            pytest.fail(timeout_msg)
+
+
+@pytest.fixture
+def batch_helper(openai_client):
+    """Fixture that provides a BatchHelper instance for OpenAI client."""
+    return BatchHelper(openai_client)
--- a/tests/integration/batches/test_batches.py
+++ b/tests/integration/batches/test_batches.py
@ -0,0 +1,270 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+"""
+Integration tests for the Llama Stack batch processing functionality.
+
+This module contains comprehensive integration tests for the batch processing API,
+using the OpenAI-compatible client interface for consistency.
+
+Test Categories:
+    1. Core Batch Operations:
+        - test_batch_creation_and_retrieval: Comprehensive batch creation, structure validation, and retrieval
+        - test_batch_listing: Basic batch listing functionality
+        - test_batch_immediate_cancellation: Batch cancellation workflow
+        # TODO: cancel during processing
+
+    2. End-to-End Processing:
+        - test_batch_e2e_chat_completions: Full chat completions workflow with output and error validation
+
+Note: Error conditions and edge cases are primarily tested in test_batches_errors.py
+for better organization and separation of concerns.
+
+CLEANUP WARNING: These tests currently create batches that are not automatically
+cleaned up after test completion. This may lead to resource accumulation over
+multiple test runs. Only test_batch_immediate_cancellation properly cancels its batch.
+The test_batch_e2e_chat_completions test does clean up its output and error files.
+"""
+
+import json
+
+
+class TestBatchesIntegration:
+    """Integration tests for the batches API."""
+
+    def test_batch_creation_and_retrieval(self, openai_client, batch_helper, text_model_id):
+        """Test comprehensive batch creation and retrieval scenarios."""
+        test_metadata = {
+            "test_type": "comprehensive",
+            "purpose": "creation_and_retrieval_test",
+            "version": "1.0",
+            "tags": "test,batch",
+        }
+
+        batch_requests = [
+            {
+                "custom_id": "request-1",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "Hello"}],
+                    "max_tokens": 10,
+                },
+            }
+        ]
+
+        with batch_helper.create_file(batch_requests, "batch_creation_test") as uploaded_file:
+            batch = openai_client.batches.create(
+                input_file_id=uploaded_file.id,
+                endpoint="/v1/chat/completions",
+                completion_window="24h",
+                metadata=test_metadata,
+            )
+
+            assert batch.endpoint == "/v1/chat/completions"
+            assert batch.input_file_id == uploaded_file.id
+            assert batch.completion_window == "24h"
+            assert batch.metadata == test_metadata
+
+            retrieved_batch = openai_client.batches.retrieve(batch.id)
+
+            assert retrieved_batch.id == batch.id
+            assert retrieved_batch.object == batch.object
+            assert retrieved_batch.endpoint == batch.endpoint
+            assert retrieved_batch.input_file_id == batch.input_file_id
+            assert retrieved_batch.completion_window == batch.completion_window
+            assert retrieved_batch.metadata == batch.metadata
+
+    def test_batch_listing(self, openai_client, batch_helper, text_model_id):
+        """
+        Test batch listing.
+
+        This test creates multiple batches and verifies that they can be listed.
+        It also deletes the input files before execution, which means the batches
+        will appear as failed due to missing input files. This is expected and
+        a good thing, because it means no inference is performed.
+        """
+        batch_ids = []
+
+        for i in range(2):
+            batch_requests = [
+                {
+                    "custom_id": f"request-{i}",
+                    "method": "POST",
+                    "url": "/v1/chat/completions",
+                    "body": {
+                        "model": text_model_id,
+                        "messages": [{"role": "user", "content": f"Hello {i}"}],
+                        "max_tokens": 10,
+                    },
+                }
+            ]
+
+            with batch_helper.create_file(batch_requests, f"batch_input_{i}") as uploaded_file:
+                batch = openai_client.batches.create(
+                    input_file_id=uploaded_file.id,
+                    endpoint="/v1/chat/completions",
+                    completion_window="24h",
+                )
+                batch_ids.append(batch.id)
+
+        batch_list = openai_client.batches.list()
+
+        assert isinstance(batch_list.data, list)
+
+        listed_batch_ids = {b.id for b in batch_list.data}
+        for batch_id in batch_ids:
+            assert batch_id in listed_batch_ids
+
+    def test_batch_immediate_cancellation(self, openai_client, batch_helper, text_model_id):
+        """Test immediate batch cancellation."""
+        batch_requests = [
+            {
+                "custom_id": "request-1",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "Hello"}],
+                    "max_tokens": 10,
+                },
+            }
+        ]
+
+        with batch_helper.create_file(batch_requests) as uploaded_file:
+            batch = openai_client.batches.create(
+                input_file_id=uploaded_file.id,
+                endpoint="/v1/chat/completions",
+                completion_window="24h",
+            )
+
+            # hopefully cancel the batch before it completes
+            cancelling_batch = openai_client.batches.cancel(batch.id)
+            assert cancelling_batch.status in ["cancelling", "cancelled"]
+            assert isinstance(cancelling_batch.cancelling_at, int), (
+                f"cancelling_at should be int, got {type(cancelling_batch.cancelling_at)}"
+            )
+
+            final_batch = batch_helper.wait_for(
+                batch.id,
+                max_wait_time=3 * 60,  # often takes 10-11 minutes, give it 3 min
+                expected_statuses={"cancelled"},
+                timeout_action="skip",
+            )
+
+        assert final_batch.status == "cancelled"
+        assert isinstance(final_batch.cancelled_at, int), (
+            f"cancelled_at should be int, got {type(final_batch.cancelled_at)}"
+        )
+
+    def test_batch_e2e_chat_completions(self, openai_client, batch_helper, text_model_id):
+        """Test end-to-end batch processing for chat completions with both successful and failed operations."""
+        batch_requests = [
+            {
+                "custom_id": "success-1",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "Say hello"}],
+                    "max_tokens": 20,
+                },
+            },
+            {
+                "custom_id": "error-1",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"rolez": "user", "contentz": "This should fail"}],  # Invalid keys to trigger error
+                    # note: ollama does not validate max_tokens values or the "role" key, so they won't trigger an error
+                },
+            },
+        ]
+
+        with batch_helper.create_file(batch_requests) as uploaded_file:
+            batch = openai_client.batches.create(
+                input_file_id=uploaded_file.id,
+                endpoint="/v1/chat/completions",
+                completion_window="24h",
+                metadata={"test": "e2e_success_and_errors_test"},
+            )
+
+            final_batch = batch_helper.wait_for(
+                batch.id,
+                max_wait_time=3 * 60,  # often takes 2-3 minutes
+                expected_statuses={"completed"},
+                timeout_action="skip",
+            )
+
+        # Expecting a completed batch with both successful and failed requests
+        #  Batch(id='batch_xxx',
+        #        completion_window='24h',
+        #        created_at=...,
+        #        endpoint='/v1/chat/completions',
+        #        input_file_id='file-xxx',
+        #        object='batch',
+        #        status='completed',
+        #        output_file_id='file-xxx',
+        #        error_file_id='file-xxx',
+        #        request_counts=BatchRequestCounts(completed=1, failed=1, total=2))
+
+        assert final_batch.status == "completed"
+        assert final_batch.request_counts is not None
+        assert final_batch.request_counts.total == 2
+        assert final_batch.request_counts.completed == 1
+        assert final_batch.request_counts.failed == 1
+
+        assert final_batch.output_file_id is not None, "Output file should exist for successful requests"
+
+        output_content = openai_client.files.content(final_batch.output_file_id)
+        if isinstance(output_content, str):
+            output_text = output_content
+        else:
+            output_text = output_content.content.decode("utf-8")
+
+        output_lines = output_text.strip().split("\n")
+
+        for line in output_lines:
+            result = json.loads(line)
+
+            assert "id" in result
+            assert "custom_id" in result
+            assert result["custom_id"] == "success-1"
+
+            assert "response" in result
+
+            assert result["response"]["status_code"] == 200
+            assert "body" in result["response"]
+            assert "choices" in result["response"]["body"]
+
+        assert final_batch.error_file_id is not None, "Error file should exist for failed requests"
+
+        error_content = openai_client.files.content(final_batch.error_file_id)
+        if isinstance(error_content, str):
+            error_text = error_content
+        else:
+            error_text = error_content.content.decode("utf-8")
+
+        error_lines = error_text.strip().split("\n")
+
+        for line in error_lines:
+            result = json.loads(line)
+
+            assert "id" in result
+            assert "custom_id" in result
+            assert result["custom_id"] == "error-1"
+            assert "error" in result
+            error = result["error"]
+            assert error is not None
+            assert "code" in error or "message" in error, "Error should have code or message"
+
+        deleted_output_file = openai_client.files.delete(final_batch.output_file_id)
+        assert deleted_output_file.deleted, f"Output file {final_batch.output_file_id} was not deleted successfully"
+
+        deleted_error_file = openai_client.files.delete(final_batch.error_file_id)
+        assert deleted_error_file.deleted, f"Error file {final_batch.error_file_id} was not deleted successfully"
--- a/tests/integration/batches/test_batches_errors.py
+++ b/tests/integration/batches/test_batches_errors.py
@ -0,0 +1,693 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+"""
+Error handling and edge case tests for the Llama Stack batch processing functionality.
+
+This module focuses exclusively on testing error conditions, validation failures,
+and edge cases for batch operations to ensure robust error handling and graceful
+degradation.
+
+Test Categories:
+    1. File and Input Validation:
+        - test_batch_nonexistent_file_id: Handling invalid file IDs
+        - test_batch_malformed_jsonl: Processing malformed JSONL input files
+        - test_file_malformed_batch_file: Handling malformed files at upload time
+        - test_batch_missing_required_fields: Validation of required request fields
+
+    2. API Endpoint and Model Validation:
+        - test_batch_invalid_endpoint: Invalid endpoint handling during creation
+        - test_batch_error_handling_invalid_model: Error handling with nonexistent models
+        - test_batch_endpoint_mismatch: Validation of endpoint/URL consistency
+
+    3. Batch Lifecycle Error Handling:
+        - test_batch_retrieve_nonexistent: Retrieving non-existent batches
+        - test_batch_cancel_nonexistent: Cancelling non-existent batches
+        - test_batch_cancel_completed: Attempting to cancel completed batches
+
+    4. Parameter and Configuration Validation:
+        - test_batch_invalid_completion_window: Invalid completion window values
+        - test_batch_invalid_metadata_types: Invalid metadata type validation
+        - test_batch_missing_required_body_fields: Validation of required fields in request body
+
+    5. Feature Restriction and Compatibility:
+        - test_batch_streaming_not_supported: Streaming request rejection
+        - test_batch_mixed_streaming_requests: Mixed streaming/non-streaming validation
+
+Note: Core functionality and OpenAI compatibility tests are located in
+test_batches_integration.py for better organization and separation of concerns.
+
+CLEANUP WARNING: These tests create batches to test error conditions but do not
+automatically clean them up after test completion. While most error tests create
+batches that fail quickly, some may create valid batches that consume resources.
+"""
+
+import pytest
+from openai import BadRequestError, ConflictError, NotFoundError
+
+
+class TestBatchesErrorHandling:
+    """Error handling and edge case tests for the batches API using OpenAI client."""
+
+    def test_batch_nonexistent_file_id(self, openai_client, batch_helper):
+        """Test batch creation with nonexistent input file ID."""
+
+        batch = openai_client.batches.create(
+            input_file_id="file-nonexistent-xyz",
+            endpoint="/v1/chat/completions",
+            completion_window="24h",
+        )
+
+        final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
+
+        # Expecting -
+        #  Batch(...,
+        #    status='failed',
+        #    errors=Errors(data=[
+        #      BatchError(
+        #        code='invalid_request',
+        #        line=None,
+        #        message='Cannot find file ..., or organization ... does not have access to it.',
+        #        param='file_id')
+        #    ], object='list'),
+        #    failed_at=1754566971,
+        #    ...)
+
+        assert final_batch.status == "failed"
+        assert final_batch.errors is not None
+        assert len(final_batch.errors.data) == 1
+        error = final_batch.errors.data[0]
+        assert error.code == "invalid_request"
+        assert "cannot find file" in error.message.lower()
+
+    def test_batch_invalid_endpoint(self, openai_client, batch_helper, text_model_id):
+        """Test batch creation with invalid endpoint."""
+        batch_requests = [
+            {
+                "custom_id": "invalid-endpoint",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "Hello"}],
+                    "max_tokens": 10,
+                },
+            }
+        ]
+
+        with batch_helper.create_file(batch_requests) as uploaded_file:
+            with pytest.raises(BadRequestError) as exc_info:
+                openai_client.batches.create(
+                    input_file_id=uploaded_file.id,
+                    endpoint="/v1/invalid/endpoint",
+                    completion_window="24h",
+                )
+
+            # Expected -
+            #  Error code: 400 - {
+            #    'error': {
+            #      'message': "Invalid value: '/v1/invalid/endpoint'. Supported values are: '/v1/chat/completions', '/v1/completions', '/v1/embeddings', and '/v1/responses'.",
+            #      'type': 'invalid_request_error',
+            #      'param': 'endpoint',
+            #      'code': 'invalid_value'
+            #    }
+            #  }
+
+            error_msg = str(exc_info.value).lower()
+            assert exc_info.value.status_code == 400
+            assert "invalid value" in error_msg
+            assert "/v1/invalid/endpoint" in error_msg
+            assert "supported values" in error_msg
+            assert "endpoint" in error_msg
+            assert "invalid_value" in error_msg
+
+    def test_batch_malformed_jsonl(self, openai_client, batch_helper):
+        """
+        Test batch with malformed JSONL input.
+
+        The /v1/files endpoint requires valid JSONL format, so we provide a well formed line
+        before a malformed line to ensure we get to the /v1/batches validation stage.
+        """
+        with batch_helper.create_file(
+            """{"custom_id": "valid", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "test"}}
+{invalid json here""",
+            "malformed_batch_input.jsonl",
+        ) as uploaded_file:
+            batch = openai_client.batches.create(
+                input_file_id=uploaded_file.id,
+                endpoint="/v1/chat/completions",
+                completion_window="24h",
+            )
+
+            final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
+
+        # Expecting -
+        #  Batch(...,
+        #    status='failed',
+        #    errors=Errors(data=[
+        #      ...,
+        #      BatchError(code='invalid_json_line',
+        #                 line=2,
+        #                 message='This line is not parseable as valid JSON.',
+        #                 param=None)
+        #    ], object='list'),
+        #    ...)
+
+        assert final_batch.status == "failed"
+        assert final_batch.errors is not None
+        assert len(final_batch.errors.data) > 0
+        error = final_batch.errors.data[-1]  # get last error because first may be about the "test" model
+        assert error.code == "invalid_json_line"
+        assert error.line == 2
+        assert "not" in error.message.lower()
+        assert "valid json" in error.message.lower()
+
+    @pytest.mark.xfail(reason="Not all file providers validate content")
+    @pytest.mark.parametrize("batch_requests", ["", "{malformed json"], ids=["empty", "malformed"])
+    def test_file_malformed_batch_file(self, openai_client, batch_helper, batch_requests):
+        """Test file upload with malformed content."""
+
+        with pytest.raises(BadRequestError) as exc_info:
+            with batch_helper.create_file(batch_requests, "malformed_batch_input_file.jsonl"):
+                # /v1/files rejects the file, we don't get to batch creation
+                pass
+
+        error_msg = str(exc_info.value).lower()
+        assert exc_info.value.status_code == 400
+        assert "invalid file format" in error_msg
+        assert "jsonl" in error_msg
+
+    def test_batch_retrieve_nonexistent(self, openai_client):
+        """Test retrieving nonexistent batch."""
+        with pytest.raises(NotFoundError) as exc_info:
+            openai_client.batches.retrieve("batch-nonexistent-xyz")
+
+        error_msg = str(exc_info.value).lower()
+        assert exc_info.value.status_code == 404
+        assert "no batch found" in error_msg or "not found" in error_msg
+
+    def test_batch_cancel_nonexistent(self, openai_client):
+        """Test cancelling nonexistent batch."""
+        with pytest.raises(NotFoundError) as exc_info:
+            openai_client.batches.cancel("batch-nonexistent-xyz")
+
+        error_msg = str(exc_info.value).lower()
+        assert exc_info.value.status_code == 404
+        assert "no batch found" in error_msg or "not found" in error_msg
+
+    def test_batch_cancel_completed(self, openai_client, batch_helper, text_model_id):
+        """Test cancelling already completed batch."""
+        batch_requests = [
+            {
+                "custom_id": "cancel-completed",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "Quick test"}],
+                    "max_tokens": 5,
+                },
+            }
+        ]
+
+        with batch_helper.create_file(batch_requests, "cancel_test_batch_input") as uploaded_file:
+            batch = openai_client.batches.create(
+                input_file_id=uploaded_file.id,
+                endpoint="/v1/chat/completions",
+                completion_window="24h",
+            )
+
+            final_batch = batch_helper.wait_for(
+                batch.id,
+                max_wait_time=3 * 60,  # often take 10-11 min, give it 3 min
+                expected_statuses={"completed"},
+                timeout_action="skip",
+            )
+
+        deleted_file = openai_client.files.delete(final_batch.output_file_id)
+        assert deleted_file.deleted, f"File {final_batch.output_file_id} was not deleted successfully"
+
+        with pytest.raises(ConflictError) as exc_info:
+            openai_client.batches.cancel(batch.id)
+
+        # Expecting -
+        #   Error code: 409 - {
+        #     'error': {
+        #       'message': "Cannot cancel a batch with status 'completed'.",
+        #       'type': 'invalid_request_error',
+        #       'param': None,
+        #       'code': None
+        #     }
+        #   }
+        #
+        # NOTE: Same for "failed", cancelling "cancelled" batches is allowed
+
+        error_msg = str(exc_info.value).lower()
+        assert exc_info.value.status_code == 409
+        assert "cannot cancel" in error_msg
+
+    def test_batch_missing_required_fields(self, openai_client, batch_helper, text_model_id):
+        """Test batch with requests missing required fields."""
+        batch_requests = [
+            {
+                # Missing custom_id
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "No custom_id"}],
+                    "max_tokens": 10,
+                },
+            },
+            {
+                "custom_id": "no-method",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "No method"}],
+                    "max_tokens": 10,
+                },
+            },
+            {
+                "custom_id": "no-url",
+                "method": "POST",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "No URL"}],
+                    "max_tokens": 10,
+                },
+            },
+            {
+                "custom_id": "no-body",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+            },
+        ]
+
+        with batch_helper.create_file(batch_requests, "missing_fields_batch_input") as uploaded_file:
+            batch = openai_client.batches.create(
+                input_file_id=uploaded_file.id,
+                endpoint="/v1/chat/completions",
+                completion_window="24h",
+            )
+
+            final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
+
+        # Expecting -
+        #  Batch(...,
+        #    status='failed',
+        #    errors=Errors(
+        #      data=[
+        #        BatchError(
+        #          code='missing_required_parameter',
+        #          line=1,
+        #          message="Missing required parameter: 'custom_id'.",
+        #          param='custom_id'
+        #        ),
+        #        BatchError(
+        #          code='missing_required_parameter',
+        #          line=2,
+        #          message="Missing required parameter: 'method'.",
+        #          param='method'
+        #        ),
+        #        BatchError(
+        #          code='missing_required_parameter',
+        #          line=3,
+        #          message="Missing required parameter: 'url'.",
+        #          param='url'
+        #        ),
+        #        BatchError(
+        #          code='missing_required_parameter',
+        #          line=4,
+        #          message="Missing required parameter: 'body'.",
+        #          param='body'
+        #        )
+        #    ], object='list'),
+        #    failed_at=1754566945,
+        #    ...)
+        #  )
+
+        assert final_batch.status == "failed"
+        assert final_batch.errors is not None
+        assert len(final_batch.errors.data) == 4
+        no_custom_id_error = final_batch.errors.data[0]
+        assert no_custom_id_error.code == "missing_required_parameter"
+        assert no_custom_id_error.line == 1
+        assert "missing" in no_custom_id_error.message.lower()
+        assert "custom_id" in no_custom_id_error.message.lower()
+        no_method_error = final_batch.errors.data[1]
+        assert no_method_error.code == "missing_required_parameter"
+        assert no_method_error.line == 2
+        assert "missing" in no_method_error.message.lower()
+        assert "method" in no_method_error.message.lower()
+        no_url_error = final_batch.errors.data[2]
+        assert no_url_error.code == "missing_required_parameter"
+        assert no_url_error.line == 3
+        assert "missing" in no_url_error.message.lower()
+        assert "url" in no_url_error.message.lower()
+        no_body_error = final_batch.errors.data[3]
+        assert no_body_error.code == "missing_required_parameter"
+        assert no_body_error.line == 4
+        assert "missing" in no_body_error.message.lower()
+        assert "body" in no_body_error.message.lower()
+
+    def test_batch_invalid_completion_window(self, openai_client, batch_helper, text_model_id):
+        """Test batch creation with invalid completion window."""
+        batch_requests = [
+            {
+                "custom_id": "invalid-completion-window",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "Hello"}],
+                    "max_tokens": 10,
+                },
+            }
+        ]
+
+        with batch_helper.create_file(batch_requests) as uploaded_file:
+            for window in ["1h", "48h", "invalid", ""]:
+                with pytest.raises(BadRequestError) as exc_info:
+                    openai_client.batches.create(
+                        input_file_id=uploaded_file.id,
+                        endpoint="/v1/chat/completions",
+                        completion_window=window,
+                    )
+            assert exc_info.value.status_code == 400
+            error_msg = str(exc_info.value).lower()
+            assert "error" in error_msg
+            assert "completion_window" in error_msg
+
+    def test_batch_streaming_not_supported(self, openai_client, batch_helper, text_model_id):
+        """Test that streaming responses are not supported in batches."""
+        batch_requests = [
+            {
+                "custom_id": "streaming-test",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "Hello"}],
+                    "max_tokens": 10,
+                    "stream": True,  # Not supported
+                },
+            }
+        ]
+
+        with batch_helper.create_file(batch_requests, "streaming_batch_input") as uploaded_file:
+            batch = openai_client.batches.create(
+                input_file_id=uploaded_file.id,
+                endpoint="/v1/chat/completions",
+                completion_window="24h",
+            )
+
+            final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
+
+        # Expecting -
+        #  Batch(...,
+        #    status='failed',
+        #    errors=Errors(data=[
+        #       BatchError(code='streaming_unsupported',
+        #         line=1,
+        #         message='Chat Completions: Streaming is not supported in the Batch API.',
+        #         param='body.stream')
+        #    ], object='list'),
+        #    failed_at=1754566965,
+        #    ...)
+
+        assert final_batch.status == "failed"
+        assert final_batch.errors is not None
+        assert len(final_batch.errors.data) == 1
+        error = final_batch.errors.data[0]
+        assert error.code == "streaming_unsupported"
+        assert error.line == 1
+        assert "streaming" in error.message.lower()
+        assert "not supported" in error.message.lower()
+        assert error.param == "body.stream"
+        assert final_batch.failed_at is not None
+
+    def test_batch_mixed_streaming_requests(self, openai_client, batch_helper, text_model_id):
+        """
+        Test batch with mixed streaming and non-streaming requests.
+
+        This is distinct from test_batch_streaming_not_supported, which tests a single
+        streaming request, to ensure an otherwise valid batch fails when a single
+        streaming request is included.
+        """
+        batch_requests = [
+            {
+                "custom_id": "valid-non-streaming-request",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "Hello without streaming"}],
+                    "max_tokens": 10,
+                },
+            },
+            {
+                "custom_id": "streaming-request",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "Hello with streaming"}],
+                    "max_tokens": 10,
+                    "stream": True,  # Not supported
+                },
+            },
+        ]
+
+        with batch_helper.create_file(batch_requests, "mixed_streaming_batch_input") as uploaded_file:
+            batch = openai_client.batches.create(
+                input_file_id=uploaded_file.id,
+                endpoint="/v1/chat/completions",
+                completion_window="24h",
+            )
+
+            final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
+
+        # Expecting -
+        #  Batch(...,
+        #    status='failed',
+        #    errors=Errors(data=[
+        #      BatchError(
+        #        code='streaming_unsupported',
+        #        line=2,
+        #        message='Chat Completions: Streaming is not supported in the Batch API.',
+        #        param='body.stream')
+        #    ], object='list'),
+        #    failed_at=1754574442,
+        #    ...)
+
+        assert final_batch.status == "failed"
+        assert final_batch.errors is not None
+        assert len(final_batch.errors.data) == 1
+        error = final_batch.errors.data[0]
+        assert error.code == "streaming_unsupported"
+        assert error.line == 2
+        assert "streaming" in error.message.lower()
+        assert "not supported" in error.message.lower()
+        assert error.param == "body.stream"
+        assert final_batch.failed_at is not None
+
+    def test_batch_endpoint_mismatch(self, openai_client, batch_helper, text_model_id):
+        """Test batch creation with mismatched endpoint and request URL."""
+        batch_requests = [
+            {
+                "custom_id": "endpoint-mismatch",
+                "method": "POST",
+                "url": "/v1/embeddings",  # Different from batch endpoint
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "Hello"}],
+                },
+            }
+        ]
+
+        with batch_helper.create_file(batch_requests, "endpoint_mismatch_batch_input") as uploaded_file:
+            batch = openai_client.batches.create(
+                input_file_id=uploaded_file.id,
+                endpoint="/v1/chat/completions",  # Different from request URL
+                completion_window="24h",
+            )
+
+            final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
+
+        # Expecting -
+        #  Batch(...,
+        #    status='failed',
+        #    errors=Errors(data=[
+        #      BatchError(
+        #        code='invalid_url',
+        #        line=1,
+        #        message='The URL provided for this request does not match the batch endpoint.',
+        #        param='url')
+        #    ], object='list'),
+        #    failed_at=1754566972,
+        #    ...)
+
+        assert final_batch.status == "failed"
+        assert final_batch.errors is not None
+        assert len(final_batch.errors.data) == 1
+        error = final_batch.errors.data[0]
+        assert error.line == 1
+        assert error.code == "invalid_url"
+        assert "does not match" in error.message.lower()
+        assert "endpoint" in error.message.lower()
+        assert final_batch.failed_at is not None
+
+    def test_batch_error_handling_invalid_model(self, openai_client, batch_helper):
+        """Test batch error handling with invalid model."""
+        batch_requests = [
+            {
+                "custom_id": "invalid-model",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": "nonexistent-model-xyz",
+                    "messages": [{"role": "user", "content": "Hello"}],
+                    "max_tokens": 10,
+                },
+            }
+        ]
+
+        with batch_helper.create_file(batch_requests) as uploaded_file:
+            batch = openai_client.batches.create(
+                input_file_id=uploaded_file.id,
+                endpoint="/v1/chat/completions",
+                completion_window="24h",
+            )
+
+            final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
+
+        # Expecting -
+        #  Batch(...,
+        #    status='failed',
+        #    errors=Errors(data=[
+        #      BatchError(code='model_not_found',
+        #        line=1,
+        #        message="The provided model 'nonexistent-model-xyz' is not supported by the Batch API.",
+        #        param='body.model')
+        #    ], object='list'),
+        #    failed_at=1754566978,
+        #    ...)
+
+        assert final_batch.status == "failed"
+        assert final_batch.errors is not None
+        assert len(final_batch.errors.data) == 1
+        error = final_batch.errors.data[0]
+        assert error.line == 1
+        assert error.code == "model_not_found"
+        assert "not supported" in error.message.lower()
+        assert error.param == "body.model"
+        assert final_batch.failed_at is not None
+
+    def test_batch_missing_required_body_fields(self, openai_client, batch_helper, text_model_id):
+        """Test batch with requests missing required fields in body (model and messages)."""
+        batch_requests = [
+            {
+                "custom_id": "missing-model",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    # Missing model field
+                    "messages": [{"role": "user", "content": "Hello without model"}],
+                    "max_tokens": 10,
+                },
+            },
+            {
+                "custom_id": "missing-messages",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    # Missing messages field
+                    "max_tokens": 10,
+                },
+            },
+        ]
+
+        with batch_helper.create_file(batch_requests, "missing_body_fields_batch_input") as uploaded_file:
+            batch = openai_client.batches.create(
+                input_file_id=uploaded_file.id,
+                endpoint="/v1/chat/completions",
+                completion_window="24h",
+            )
+
+            final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
+
+        # Expecting -
+        #  Batch(...,
+        #    status='failed',
+        #    errors=Errors(data=[
+        #      BatchError(
+        #        code='invalid_request',
+        #        line=1,
+        #        message='Model parameter is required.',
+        #        param='body.model'),
+        #      BatchError(
+        #        code='invalid_request',
+        #        line=2,
+        #        message='Messages parameter is required.',
+        #        param='body.messages')
+        #      ], object='list'),
+        #    ...)
+
+        assert final_batch.status == "failed"
+        assert final_batch.errors is not None
+        assert len(final_batch.errors.data) == 2
+
+        model_error = final_batch.errors.data[0]
+        assert model_error.line == 1
+        assert "model" in model_error.message.lower()
+        assert model_error.param == "body.model"
+
+        messages_error = final_batch.errors.data[1]
+        assert messages_error.line == 2
+        assert "messages" in messages_error.message.lower()
+        assert messages_error.param == "body.messages"
+
+        assert final_batch.failed_at is not None
+
+    def test_batch_invalid_metadata_types(self, openai_client, batch_helper, text_model_id):
+        """Test batch creation with invalid metadata types (like lists)."""
+        batch_requests = [
+            {
+                "custom_id": "invalid-metadata-type",
+                "method": "POST",
+                "url": "/v1/chat/completions",
+                "body": {
+                    "model": text_model_id,
+                    "messages": [{"role": "user", "content": "Hello"}],
+                    "max_tokens": 10,
+                },
+            }
+        ]
+
+        with batch_helper.create_file(batch_requests) as uploaded_file:
+            with pytest.raises(Exception) as exc_info:
+                openai_client.batches.create(
+                    input_file_id=uploaded_file.id,
+                    endpoint="/v1/chat/completions",
+                    completion_window="24h",
+                    metadata={
+                        "tags": ["tag1", "tag2"],  # Invalid type, should be a string
+                    },
+                )
+
+        # Expecting -
+        #  Error code: 400 - {'error':
+        #    {'message': "Invalid type for 'metadata.tags': expected a string,
+        #                 but got an array instead.",
+        #     'type': 'invalid_request_error', 'param': 'metadata.tags',
+        #     'code': 'invalid_type'}}
+
+        error_msg = str(exc_info.value).lower()
+        assert "400" in error_msg
+        assert "tags" in error_msg
+        assert "string" in error_msg
--- a/tests/integration/batches/test_batches_idempotency.py
+++ b/tests/integration/batches/test_batches_idempotency.py
@ -0,0 +1,91 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+"""
+Integration tests for batch idempotency functionality using the OpenAI client library.
+
+This module tests the idempotency feature in the batches API using the OpenAI-compatible
+client interface. These tests verify that the idempotency key (idempotency_key) works correctly
+in a real client-server environment.
+
+Test Categories:
+1. Successful Idempotency: Same key returns same batch with identical parameters
+   - test_idempotent_batch_creation_successful: Verifies that requests with the same
+     idempotency key return identical batches, even with different metadata order
+
+2. Conflict Detection: Same key with conflicting parameters raises HTTP 409 errors
+   - test_idempotency_conflict_with_different_params: Verifies that reusing an idempotency key
+     with truly conflicting parameters (both file ID and metadata values) raises ConflictError
+"""
+
+import time
+
+import pytest
+from openai import ConflictError
+
+
+class TestBatchesIdempotencyIntegration:
+    """Integration tests for batch idempotency using OpenAI client."""
+
+    def test_idempotent_batch_creation_successful(self, openai_client):
+        """Test that identical requests with same idempotency key return the same batch."""
+        batch1 = openai_client.batches.create(
+            input_file_id="bogus-id",
+            endpoint="/v1/chat/completions",
+            completion_window="24h",
+            metadata={
+                "test_type": "idempotency_success",
+                "purpose": "integration_test",
+            },
+            extra_body={"idempotency_key": "test-idempotency-token-1"},
+        )
+
+        # sleep to ensure different timestamps
+        time.sleep(1)
+
+        batch2 = openai_client.batches.create(
+            input_file_id="bogus-id",
+            endpoint="/v1/chat/completions",
+            completion_window="24h",
+            metadata={
+                "purpose": "integration_test",
+                "test_type": "idempotency_success",
+            },  # Different order
+            extra_body={"idempotency_key": "test-idempotency-token-1"},
+        )
+
+        assert batch1.id == batch2.id
+        assert batch1.input_file_id == batch2.input_file_id
+        assert batch1.endpoint == batch2.endpoint
+        assert batch1.completion_window == batch2.completion_window
+        assert batch1.metadata == batch2.metadata
+        assert batch1.created_at == batch2.created_at
+
+    def test_idempotency_conflict_with_different_params(self, openai_client):
+        """Test that using same idempotency key with different params raises conflict error."""
+        batch1 = openai_client.batches.create(
+            input_file_id="bogus-id-1",
+            endpoint="/v1/chat/completions",
+            completion_window="24h",
+            metadata={"test_type": "conflict_test_1"},
+            extra_body={"idempotency_key": "conflict-token"},
+        )
+
+        with pytest.raises(ConflictError) as exc_info:
+            openai_client.batches.create(
+                input_file_id="bogus-id-2",  # Different file ID
+                endpoint="/v1/chat/completions",
+                completion_window="24h",
+                metadata={"test_type": "conflict_test_2"},  # Different metadata
+                extra_body={"idempotency_key": "conflict-token"},  # Same token
+            )
+
+        assert exc_info.value.status_code == 409
+        assert "conflict" in str(exc_info.value).lower()
+
+        retrieved_batch = openai_client.batches.retrieve(batch1.id)
+        assert retrieved_batch.id == batch1.id
+        assert retrieved_batch.input_file_id == "bogus-id-1"
--- a/tests/integration/files/test_files.py
+++ b/tests/integration/files/test_files.py
@ -8,20 +8,27 @@ from io import BytesIO
 from unittest.mock import patch

 import pytest
-from openai import OpenAI

 from llama_stack.core.datatypes import User
-from llama_stack.core.library_client import LlamaStackAsLibraryClient


-def test_openai_client_basic_operations(compat_client, client_with_models):
+# a fixture to skip all these tests if a files provider is not available
+@pytest.fixture(autouse=True)
+def skip_if_no_files_provider(llama_stack_client):
+    if not [provider for provider in llama_stack_client.providers.list() if provider.api == "files"]:
+        pytest.skip("No files providers found")
+
+
+def test_openai_client_basic_operations(openai_client):
    """Test basic file operations through OpenAI client."""
-    if isinstance(client_with_models, LlamaStackAsLibraryClient) and isinstance(compat_client, OpenAI):
-        pytest.skip("OpenAI files are not supported when testing with LlamaStackAsLibraryClient")
-    client = compat_client
+    from openai import NotFoundError
+
+    client = openai_client

    test_content = b"files test content"

+    uploaded_file = None
+
    try:
        # Upload file using OpenAI client
        with BytesIO(test_content) as file_buffer:
@ -31,6 +38,7 @@ def test_openai_client_basic_operations(compat_client, client_with_models):
        # Verify basic response structure
        assert uploaded_file.id.startswith("file-")
        assert hasattr(uploaded_file, "filename")
+        assert uploaded_file.filename == "openai_test.txt"

        # List files
        files_list = client.files.list()
@ -43,37 +51,41 @@ def test_openai_client_basic_operations(compat_client, client_with_models):

        # Retrieve file content - OpenAI client returns httpx Response object
        content_response = client.files.content(uploaded_file.id)
-        # The response is an httpx Response object with .content attribute containing bytes
-        if isinstance(content_response, str):
-            # Llama Stack Client returns a str
-            # TODO: fix Llama Stack Client
-            content = bytes(content_response, "utf-8")
-        else:
-            content = content_response.content
-        assert content == test_content
+        assert content_response.content == test_content

        # Delete file
        delete_response = client.files.delete(uploaded_file.id)
        assert delete_response.deleted is True

-    except Exception as e:
-        # Cleanup in case of failure
-        try:
+        # Retrieve file should fail
+        with pytest.raises(NotFoundError, match="not found"):
+            client.files.retrieve(uploaded_file.id)
+
+        # File should not be found in listing
+        files_list = client.files.list()
+        file_ids = [f.id for f in files_list.data]
+        assert uploaded_file.id not in file_ids
+
+        # Double delete should fail
+        with pytest.raises(NotFoundError, match="not found"):
            client.files.delete(uploaded_file.id)
-        except Exception:
-            pass
-        raise e
+
+    finally:
+        # Cleanup in case of failure
+        if uploaded_file is not None:
+            try:
+                client.files.delete(uploaded_file.id)
+            except NotFoundError:
+                pass  # ignore 404


+@pytest.mark.xfail(message="User isolation broken for current providers, must be fixed.")
@patch("llama_stack.providers.utils.sqlstore.authorized_sqlstore.get_authenticated_user")
-def test_files_authentication_isolation(mock_get_authenticated_user, compat_client, client_with_models):
+def test_files_authentication_isolation(mock_get_authenticated_user, llama_stack_client):
    """Test that users can only access their own files."""
-    if isinstance(client_with_models, LlamaStackAsLibraryClient) and isinstance(compat_client, OpenAI):
-        pytest.skip("OpenAI files are not supported when testing with LlamaStackAsLibraryClient")
-    if not isinstance(client_with_models, LlamaStackAsLibraryClient):
-        pytest.skip("Authentication tests require LlamaStackAsLibraryClient (library mode)")
+    from llama_stack_client import NotFoundError

-    client = compat_client
+    client = llama_stack_client

    # Create two test users
    user1 = User("user1", {"roles": ["user"], "teams": ["team-a"]})
@ -117,7 +129,7 @@ def test_files_authentication_isolation(mock_get_authenticated_user, compat_clie

        # User 1 cannot retrieve user2's file
        mock_get_authenticated_user.return_value = user1
-        with pytest.raises(ValueError, match="not found"):
+        with pytest.raises(NotFoundError, match="not found"):
            client.files.retrieve(user2_file.id)

        # User 1 can access their file content
@ -131,7 +143,7 @@ def test_files_authentication_isolation(mock_get_authenticated_user, compat_clie

        # User 1 cannot access user2's file content
        mock_get_authenticated_user.return_value = user1
-        with pytest.raises(ValueError, match="not found"):
+        with pytest.raises(NotFoundError, match="not found"):
            client.files.content(user2_file.id)

        # User 1 can delete their own file
@ -141,7 +153,7 @@ def test_files_authentication_isolation(mock_get_authenticated_user, compat_clie

        # User 1 cannot delete user2's file
        mock_get_authenticated_user.return_value = user1
-        with pytest.raises(ValueError, match="not found"):
+        with pytest.raises(NotFoundError, match="not found"):
            client.files.delete(user2_file.id)

        # User 2 can still access their file after user1's file is deleted
@ -169,14 +181,9 @@ def test_files_authentication_isolation(mock_get_authenticated_user, compat_clie


@patch("llama_stack.providers.utils.sqlstore.authorized_sqlstore.get_authenticated_user")
-def test_files_authentication_shared_attributes(mock_get_authenticated_user, compat_client, client_with_models):
+def test_files_authentication_shared_attributes(mock_get_authenticated_user, llama_stack_client):
    """Test access control with users having identical attributes."""
-    if isinstance(client_with_models, LlamaStackAsLibraryClient) and isinstance(compat_client, OpenAI):
-        pytest.skip("OpenAI files are not supported when testing with LlamaStackAsLibraryClient")
-    if not isinstance(client_with_models, LlamaStackAsLibraryClient):
-        pytest.skip("Authentication tests require LlamaStackAsLibraryClient (library mode)")
-
-    client = compat_client
+    client = llama_stack_client

    # Create users with identical attributes (required for default policy)
    user_a = User("user-a", {"roles": ["user"], "teams": ["shared-team"]})
@ -231,14 +238,8 @@ def test_files_authentication_shared_attributes(mock_get_authenticated_user, com


@patch("llama_stack.providers.utils.sqlstore.authorized_sqlstore.get_authenticated_user")
-def test_files_authentication_anonymous_access(mock_get_authenticated_user, compat_client, client_with_models):
-    """Test anonymous user behavior when no authentication is present."""
-    if isinstance(client_with_models, LlamaStackAsLibraryClient) and isinstance(compat_client, OpenAI):
-        pytest.skip("OpenAI files are not supported when testing with LlamaStackAsLibraryClient")
-    if not isinstance(client_with_models, LlamaStackAsLibraryClient):
-        pytest.skip("Authentication tests require LlamaStackAsLibraryClient (library mode)")
-
-    client = compat_client
+def test_files_authentication_anonymous_access(mock_get_authenticated_user, llama_stack_client):
+    client = llama_stack_client

    # Simulate anonymous user (no authentication)
    mock_get_authenticated_user.return_value = None
--- a/tests/integration/fixtures/common.py
+++ b/tests/integration/fixtures/common.py
@ -256,15 +256,25 @@ def instantiate_llama_stack_client(session):
        provider_data=get_provider_data(),
        skip_logger_removal=True,
    )
-    if not client.initialize():
-        raise RuntimeError("Initialization failed")
-
    return client


@pytest.fixture(scope="session")
-def openai_client(client_with_models):
-    base_url = f"{client_with_models.base_url}/v1/openai/v1"
+def require_server(llama_stack_client):
+    """
+    Skip test if no server is running.
+
+    We use the llama_stack_client to tell if a server was started or not.
+
+    We use this with openai_client because it relies on a running server.
+    """
+    if isinstance(llama_stack_client, LlamaStackAsLibraryClient):
+        pytest.skip("No server running")
+
+
+@pytest.fixture(scope="session")
+def openai_client(llama_stack_client, require_server):
+    base_url = f"{llama_stack_client.base_url}/v1/openai/v1"
    return OpenAI(base_url=base_url, api_key="fake")


--- a/tests/integration/inference/test_embedding.py
+++ b/tests/integration/inference/test_embedding.py
@ -55,7 +55,7 @@
 #

 import pytest
-from llama_stack_client import BadRequestError
+from llama_stack_client import BadRequestError as LlamaStackBadRequestError
 from llama_stack_client.types import EmbeddingsResponse
 from llama_stack_client.types.shared.interleaved_content import (
    ImageContentItem,
@ -63,6 +63,9 @@ from llama_stack_client.types.shared.interleaved_content import (
    ImageContentItemImageURL,
    TextContentItem,
 )
+from openai import BadRequestError as OpenAIBadRequestError
+
+from llama_stack.core.library_client import LlamaStackAsLibraryClient

 DUMMY_STRING = "hello"
 DUMMY_STRING2 = "world"
@ -203,7 +206,14 @@ def test_embedding_truncation_error(
 ):
    if inference_provider_type not in SUPPORTED_PROVIDERS:
        pytest.xfail(f"{inference_provider_type} doesn't support embedding model yet")
-    with pytest.raises(BadRequestError):
+    # Using LlamaStackClient from llama_stack_client will raise llama_stack_client.BadRequestError
+    # While using LlamaStackAsLibraryClient from llama_stack.distribution.library_client will raise the error that the backend raises
+    error_type = (
+        OpenAIBadRequestError
+        if isinstance(llama_stack_client, LlamaStackAsLibraryClient)
+        else LlamaStackBadRequestError
+    )
+    with pytest.raises(error_type):
        llama_stack_client.inference.embeddings(
            model_id=embedding_model_id,
            contents=[DUMMY_LONG_TEXT],
@ -283,7 +293,8 @@ def test_embedding_text_truncation_error(
 ):
    if inference_provider_type not in SUPPORTED_PROVIDERS:
        pytest.xfail(f"{inference_provider_type} doesn't support embedding model yet")
-    with pytest.raises(BadRequestError):
+    error_type = ValueError if isinstance(llama_stack_client, LlamaStackAsLibraryClient) else LlamaStackBadRequestError
+    with pytest.raises(error_type):
        llama_stack_client.inference.embeddings(
            model_id=embedding_model_id,
            contents=[DUMMY_STRING],
--- a/tests/integration/non_ci/responses/fixtures/fixtures.py
+++ b/tests/integration/non_ci/responses/fixtures/fixtures.py
@ -5,7 +5,6 @@
 # the root directory of this source tree.

 import os
-import re
 from pathlib import Path

 import pytest
@ -48,19 +47,6 @@ def _load_all_verification_configs():
    return {"providers": all_provider_configs}


-def case_id_generator(case):
-    """Generate a test ID from the case's 'case_id' field, or use a default."""
-    case_id = case.get("case_id")
-    if isinstance(case_id, str | int):
-        return re.sub(r"\\W|^(?=\\d)", "_", str(case_id))
-    return None
-
-
-# Helper to get the base test name from the request object
-def get_base_test_name(request):
-    return request.node.originalname
-
-
 # --- End Helper Functions ---


@ -127,8 +113,6 @@ def openai_client(base_url, api_key, provider):
            raise ValueError(f"Invalid config for Llama Stack: {provider}, it must be of the form 'stack:<config>'")
        config = parts[1]
        client = LlamaStackAsLibraryClient(config, skip_logger_removal=True)
-        if not client.initialize():
-            raise RuntimeError("Initialization failed")
        return client

    return OpenAI(
--- a/tests/integration/non_ci/responses/fixtures/load.py
+++ b/tests/integration/non_ci/responses/fixtures/load.py
@ -1,16 +0,0 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the terms described in the LICENSE file in
-# the root directory of this source tree.
-
-from pathlib import Path
-
-import yaml
-
-
-def load_test_cases(name: str):
-    fixture_dir = Path(__file__).parent / "test_cases"
-    yaml_path = fixture_dir / f"{name}.yaml"
-    with open(yaml_path) as f:
-        return yaml.safe_load(f)
--- a/tests/integration/non_ci/responses/fixtures/test_cases.py
+++ b/tests/integration/non_ci/responses/fixtures/test_cases.py
@ -0,0 +1,262 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from typing import Any
+
+import pytest
+from pydantic import BaseModel
+
+
+class ResponsesTestCase(BaseModel):
+    # Input can be a simple string or complex message structure
+    input: str | list[dict[str, Any]]
+    expected: str
+    # Tools as flexible dict structure (gets validated at runtime by the API)
+    tools: list[dict[str, Any]] | None = None
+    # Multi-turn conversations with input/output pairs
+    turns: list[tuple[str | list[dict[str, Any]], str]] | None = None
+    # File search specific fields
+    file_content: str | None = None
+    file_path: str | None = None
+    # Streaming flag
+    stream: bool | None = None
+
+
+# Basic response test cases
+basic_test_cases = [
+    pytest.param(
+        ResponsesTestCase(
+            input="Which planet do humans live on?",
+            expected="earth",
+        ),
+        id="earth",
+    ),
+    pytest.param(
+        ResponsesTestCase(
+            input="Which planet has rings around it with a name starting with letter S?",
+            expected="saturn",
+        ),
+        id="saturn",
+    ),
+    pytest.param(
+        ResponsesTestCase(
+            input=[
+                {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "input_text",
+                            "text": "what teams are playing in this image?",
+                        }
+                    ],
+                },
+                {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "input_image",
+                            "image_url": "https://upload.wikimedia.org/wikipedia/commons/3/3b/LeBron_James_Layup_%28Cleveland_vs_Brooklyn_2018%29.jpg",
+                        }
+                    ],
+                },
+            ],
+            expected="brooklyn nets",
+        ),
+        id="image_input",
+    ),
+]
+
+# Multi-turn test cases
+multi_turn_test_cases = [
+    pytest.param(
+        ResponsesTestCase(
+            input="",  # Not used for multi-turn
+            expected="",  # Not used for multi-turn
+            turns=[
+                ("Which planet do humans live on?", "earth"),
+                ("What is the name of the planet from your previous response?", "earth"),
+            ],
+        ),
+        id="earth",
+    ),
+]
+
+# Web search test cases
+web_search_test_cases = [
+    pytest.param(
+        ResponsesTestCase(
+            input="How many experts does the Llama 4 Maverick model have?",
+            tools=[{"type": "web_search", "search_context_size": "low"}],
+            expected="128",
+        ),
+        id="llama_experts",
+    ),
+]
+
+# File search test cases
+file_search_test_cases = [
+    pytest.param(
+        ResponsesTestCase(
+            input="How many experts does the Llama 4 Maverick model have?",
+            tools=[{"type": "file_search"}],
+            expected="128",
+            file_content="Llama 4 Maverick has 128 experts",
+        ),
+        id="llama_experts",
+    ),
+    pytest.param(
+        ResponsesTestCase(
+            input="How many experts does the Llama 4 Maverick model have?",
+            tools=[{"type": "file_search"}],
+            expected="128",
+            file_path="pdfs/llama_stack_and_models.pdf",
+        ),
+        id="llama_experts_pdf",
+    ),
+]
+
+# MCP tool test cases
+mcp_tool_test_cases = [
+    pytest.param(
+        ResponsesTestCase(
+            input="What is the boiling point of myawesomeliquid in Celsius?",
+            tools=[{"type": "mcp", "server_label": "localmcp", "server_url": "<FILLED_BY_TEST_RUNNER>"}],
+            expected="Hello, world!",
+        ),
+        id="boiling_point_tool",
+    ),
+]
+
+# Custom tool test cases
+custom_tool_test_cases = [
+    pytest.param(
+        ResponsesTestCase(
+            input="What's the weather like in San Francisco?",
+            tools=[
+                {
+                    "type": "function",
+                    "name": "get_weather",
+                    "description": "Get current temperature for a given location.",
+                    "parameters": {
+                        "additionalProperties": False,
+                        "properties": {
+                            "location": {
+                                "description": "City and country e.g. Bogotá, Colombia",
+                                "type": "string",
+                            }
+                        },
+                        "required": ["location"],
+                        "type": "object",
+                    },
+                }
+            ],
+            expected="",  # No specific expected output for custom tools
+        ),
+        id="sf_weather",
+    ),
+]
+
+# Image test cases
+image_test_cases = [
+    pytest.param(
+        ResponsesTestCase(
+            input=[
+                {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "input_text",
+                            "text": "Identify the type of animal in this image.",
+                        },
+                        {
+                            "type": "input_image",
+                            "image_url": "https://upload.wikimedia.org/wikipedia/commons/f/f7/Llamas%2C_Vernagt-Stausee%2C_Italy.jpg",
+                        },
+                    ],
+                },
+            ],
+            expected="llama",
+        ),
+        id="llama_image",
+    ),
+]
+
+# Multi-turn image test cases
+multi_turn_image_test_cases = [
+    pytest.param(
+        ResponsesTestCase(
+            input="",  # Not used for multi-turn
+            expected="",  # Not used for multi-turn
+            turns=[
+                (
+                    [
+                        {
+                            "role": "user",
+                            "content": [
+                                {
+                                    "type": "input_text",
+                                    "text": "What type of animal is in this image? Please respond with a single word that starts with the letter 'L'.",
+                                },
+                                {
+                                    "type": "input_image",
+                                    "image_url": "https://upload.wikimedia.org/wikipedia/commons/f/f7/Llamas%2C_Vernagt-Stausee%2C_Italy.jpg",
+                                },
+                            ],
+                        },
+                    ],
+                    "llama",
+                ),
+                (
+                    "What country do you find this animal primarily in? What continent?",
+                    "peru",
+                ),
+            ],
+        ),
+        id="llama_image_understanding",
+    ),
+]
+
+# Multi-turn tool execution test cases
+multi_turn_tool_execution_test_cases = [
+    pytest.param(
+        ResponsesTestCase(
+            input="I need to check if user 'alice' can access the file 'document.txt'. First, get alice's user ID, then check if that user ID can access the file 'document.txt'. Do this as a series of steps, where each step is a separate message. Return only one tool call per step. Summarize the final result with a single 'yes' or 'no' response.",
+            tools=[{"type": "mcp", "server_label": "localmcp", "server_url": "<FILLED_BY_TEST_RUNNER>"}],
+            expected="yes",
+        ),
+        id="user_file_access_check",
+    ),
+    pytest.param(
+        ResponsesTestCase(
+            input="I need to get the results for the 'boiling_point' experiment. First, get the experiment ID for 'boiling_point', then use that ID to get the experiment results. Tell me the boiling point in Celsius.",
+            tools=[{"type": "mcp", "server_label": "localmcp", "server_url": "<FILLED_BY_TEST_RUNNER>"}],
+            expected="100°C",
+        ),
+        id="experiment_results_lookup",
+    ),
+]
+
+# Multi-turn tool execution streaming test cases
+multi_turn_tool_execution_streaming_test_cases = [
+    pytest.param(
+        ResponsesTestCase(
+            input="Help me with this security check: First, get the user ID for 'charlie', then get the permissions for that user ID, and finally check if that user can access 'secret_file.txt'. Stream your progress as you work through each step. Return only one tool call per step. Summarize the final result with a single 'yes' or 'no' response.",
+            tools=[{"type": "mcp", "server_label": "localmcp", "server_url": "<FILLED_BY_TEST_RUNNER>"}],
+            expected="no",
+            stream=True,
+        ),
+        id="user_permissions_workflow",
+    ),
+    pytest.param(
+        ResponsesTestCase(
+            input="I need a complete analysis: First, get the experiment ID for 'chemical_reaction', then get the results for that experiment, and tell me if the yield was above 80%. Return only one tool call per step.  Please stream your analysis process.",
+            tools=[{"type": "mcp", "server_label": "localmcp", "server_url": "<FILLED_BY_TEST_RUNNER>"}],
+            expected="85%",
+            stream=True,
+        ),
+        id="experiment_analysis_streaming",
+    ),
+]
--- a/tests/integration/non_ci/responses/fixtures/test_cases/chat_completion.yaml
+++ b/tests/integration/non_ci/responses/fixtures/test_cases/chat_completion.yaml
@ -1,397 +0,0 @@
-test_chat_basic:
-  test_name: test_chat_basic
-  test_params:
-    case:
-    - case_id: "earth"
-      input:
-        messages:
-        - content: Which planet do humans live on?
-          role: user
-      output: Earth
-    - case_id: "saturn"
-      input:
-        messages:
-        - content: Which planet has rings around it with a name starting with letter
-            S?
-          role: user
-      output: Saturn
-test_chat_input_validation:
-  test_name: test_chat_input_validation
-  test_params:
-    case:
-    - case_id: "messages_missing"
-      input:
-        messages: []
-      output:
-        error:
-          status_code: 400
-    - case_id: "messages_role_invalid"
-      input:
-        messages:
-        - content: Which planet do humans live on?
-          role: fake_role
-      output:
-        error:
-          status_code: 400
-    - case_id: "tool_choice_invalid"
-      input:
-        messages:
-        - content: Which planet do humans live on?
-          role: user
-        tool_choice: invalid
-      output:
-        error:
-          status_code: 400
-    - case_id: "tool_choice_no_tools"
-      input:
-        messages:
-        - content: Which planet do humans live on?
-          role: user
-        tool_choice: required
-      output:
-        error:
-          status_code: 400
-    - case_id: "tools_type_invalid"
-      input:
-        messages:
-        - content: Which planet do humans live on?
-          role: user
-        tools:
-        - type: invalid
-      output:
-        error:
-          status_code: 400
-test_chat_image:
-  test_name: test_chat_image
-  test_params:
-    case:
-    - input:
-        messages:
-        - content:
-          - text: What is in this image?
-            type: text
-          - image_url:
-              url: https://upload.wikimedia.org/wikipedia/commons/f/f7/Llamas%2C_Vernagt-Stausee%2C_Italy.jpg
-            type: image_url
-          role: user
-      output: llama
-test_chat_structured_output:
-  test_name: test_chat_structured_output
-  test_params:
-    case:
-    - case_id: "calendar"
-      input:
-        messages:
-        - content: Extract the event information.
-          role: system
-        - content: Alice and Bob are going to a science fair on Friday.
-          role: user
-        response_format:
-          json_schema:
-            name: calendar_event
-            schema:
-              properties:
-                date:
-                  title: Date
-                  type: string
-                name:
-                  title: Name
-                  type: string
-                participants:
-                  items:
-                    type: string
-                  title: Participants
-                  type: array
-              required:
-              - name
-              - date
-              - participants
-              title: CalendarEvent
-              type: object
-          type: json_schema
-      output: valid_calendar_event
-    - case_id: "math"
-      input:
-        messages:
-        - content: You are a helpful math tutor. Guide the user through the solution
-            step by step.
-          role: system
-        - content: how can I solve 8x + 7 = -23
-          role: user
-        response_format:
-          json_schema:
-            name: math_reasoning
-            schema:
-              $defs:
-                Step:
-                  properties:
-                    explanation:
-                      title: Explanation
-                      type: string
-                    output:
-                      title: Output
-                      type: string
-                  required:
-                  - explanation
-                  - output
-                  title: Step
-                  type: object
-              properties:
-                final_answer:
-                  title: Final Answer
-                  type: string
-                steps:
-                  items:
-                    $ref: '#/$defs/Step'
-                  title: Steps
-                  type: array
-              required:
-              - steps
-              - final_answer
-              title: MathReasoning
-              type: object
-          type: json_schema
-      output: valid_math_reasoning
-test_tool_calling:
-  test_name: test_tool_calling
-  test_params:
-    case:
-    - input:
-        messages:
-        - content: You are a helpful assistant that can use tools to get information.
-          role: system
-        - content: What's the weather like in San Francisco?
-          role: user
-        tools:
-        - function:
-            description: Get current temperature for a given location.
-            name: get_weather
-            parameters:
-              additionalProperties: false
-              properties:
-                location:
-                  description: "City and country e.g. Bogot\xE1, Colombia"
-                  type: string
-              required:
-              - location
-              type: object
-          type: function
-      output: get_weather_tool_call
-
-test_chat_multi_turn_tool_calling:
-  test_name: test_chat_multi_turn_tool_calling
-  test_params:
-    case:
-    - case_id: "text_then_weather_tool"
-      input:
-        messages:
-        - - role: user
-            content: "What's the name of the Sun in latin?"
-        - - role: user
-            content: "What's the weather like in San Francisco?"
-        tools:
-        - function:
-            description: Get the current weather
-            name: get_weather
-            parameters:
-              type: object
-              properties:
-                location:
-                  description: "The city and state (both required), e.g. San Francisco, CA."
-                  type: string
-              required: ["location"]
-          type: function
-      tool_responses:
-      - response: "{'response': '70 degrees and foggy'}"
-      expected:
-      - num_tool_calls: 0
-        answer: ["sol"]
-      - num_tool_calls: 1
-        tool_name: get_weather
-        tool_arguments:
-          location: "San Francisco, CA"
-      - num_tool_calls: 0
-        answer: ["foggy", "70 degrees"]
-    - case_id: "weather_tool_then_text"
-      input:
-        messages:
-        - - role: user
-            content: "What's the weather like in San Francisco?"
-        tools:
-        - function:
-            description: Get the current weather
-            name: get_weather
-            parameters:
-              type: object
-              properties:
-                location:
-                  description: "The city and state (both required), e.g. San Francisco, CA."
-                  type: string
-              required: ["location"]
-          type: function
-      tool_responses:
-      - response: "{'response': '70 degrees and foggy'}"
-      expected:
-      - num_tool_calls: 1
-        tool_name: get_weather
-        tool_arguments:
-          location: "San Francisco, CA"
-      - num_tool_calls: 0
-        answer: ["foggy", "70 degrees"]
-    - case_id: "add_product_tool"
-      input:
-        messages:
-        - - role: user
-            content: "Please add a new product with name 'Widget', price 19.99, in stock, and tags ['new', 'sale'] and give me the product id."
-        tools:
-        - function:
-            description: Add a new product
-            name: addProduct
-            parameters:
-              type: object
-              properties:
-                name:
-                  description: "Name of the product"
-                  type: string
-                price:
-                  description: "Price of the product"
-                  type: number
-                inStock:
-                  description: "Availability status of the product."
-                  type: boolean
-                tags:
-                  description: "List of product tags"
-                  type: array
-                  items:
-                    type: string
-              required: ["name", "price", "inStock"]
-          type: function
-      tool_responses:
-      - response: "{'response': 'Successfully added product with id: 123'}"
-      expected:
-      - num_tool_calls: 1
-        tool_name: addProduct
-        tool_arguments:
-          name: "Widget"
-          price: 19.99
-          inStock: true
-          tags:
-          - "new"
-          - "sale"
-      - num_tool_calls: 0
-        answer: ["123", "product id: 123"]
-    - case_id: "get_then_create_event_tool"
-      input:
-        messages:
-        - - role: system
-            content: "Todays date is 2025-03-01."
-          - role: user
-            content: "Do i have any meetings on March 3rd at 10 am? Yes or no?"
-        - - role: user
-            content: "Alright then, Create an event named 'Team Building', scheduled for that time same time, in the 'Main Conference Room' and add Alice, Bob, Charlie to it. Give me the created event id."
-        tools:
-        - function:
-            description: Create a new event
-            name: create_event
-            parameters:
-              type: object
-              properties:
-                name:
-                  description: "Name of the event"
-                  type: string
-                date:
-                  description: "Date of the event in ISO format"
-                  type: string
-                time:
-                  description: "Event Time (HH:MM)"
-                  type: string
-                location:
-                  description: "Location of the event"
-                  type: string
-                participants:
-                  description: "List of participant names"
-                  type: array
-                  items:
-                    type: string
-              required: ["name", "date", "time", "location", "participants"]
-          type: function
-        - function:
-            description: Get an event by date and time
-            name: get_event
-            parameters:
-              type: object
-              properties:
-                date:
-                  description: "Date of the event in ISO format"
-                  type: string
-                time:
-                  description: "Event Time (HH:MM)"
-                  type: string
-              required: ["date", "time"]
-          type: function
-      tool_responses:
-      - response: "{'response': 'No events found for 2025-03-03 at 10:00'}"
-      - response: "{'response': 'Successfully created new event with id: e_123'}"
-      expected:
-      - num_tool_calls: 1
-        tool_name: get_event
-        tool_arguments:
-          date: "2025-03-03"
-          time: "10:00"
-      - num_tool_calls: 0
-        answer: ["no", "no events found", "no meetings"]
-      - num_tool_calls: 1
-        tool_name: create_event
-        tool_arguments:
-          name: "Team Building"
-          date: "2025-03-03"
-          time: "10:00"
-          location: "Main Conference Room"
-          participants:
-          - "Alice"
-          - "Bob"
-          - "Charlie"
-      - num_tool_calls: 0
-        answer: ["e_123", "event id: e_123"]
-    - case_id: "compare_monthly_expense_tool"
-      input:
-        messages:
-        - - role: system
-            content: "Todays date is 2025-03-01."
-          - role: user
-            content: "what was my monthly expense in Jan of this year?"
-        - - role: user
-            content: "Was it less than Feb of last year? Only answer with yes or no."
-        tools:
-        - function:
-            description: Get monthly expense summary
-            name: getMonthlyExpenseSummary
-            parameters:
-              type: object
-              properties:
-                month:
-                  description: "Month of the year (1-12)"
-                  type: integer
-                year:
-                  description: "Year"
-                  type: integer
-              required: ["month", "year"]
-          type: function
-      tool_responses:
-      - response: "{'response': 'Total expenses for January 2025: $1000'}"
-      - response: "{'response': 'Total expenses for February 2024: $2000'}"
-      expected:
-      - num_tool_calls: 1
-        tool_name: getMonthlyExpenseSummary
-        tool_arguments:
-          month: 1
-          year: 2025
-      - num_tool_calls: 0
-        answer: ["1000", "$1,000", "1,000"]
-      - num_tool_calls: 1
-        tool_name: getMonthlyExpenseSummary
-        tool_arguments:
-          month: 2
-          year: 2024
-      - num_tool_calls: 0
-        answer: ["yes"]
--- a/tests/integration/non_ci/responses/fixtures/test_cases/responses.yaml
+++ b/tests/integration/non_ci/responses/fixtures/test_cases/responses.yaml
@ -1,166 +0,0 @@
-test_response_basic:
-  test_name: test_response_basic
-  test_params:
-    case:
-    - case_id: "earth"
-      input: "Which planet do humans live on?"
-      output: "earth"
-    - case_id: "saturn"
-      input: "Which planet has rings around it with a name starting with letter S?"
-      output: "saturn"
-    - case_id: "image_input"
-      input:
-      - role: user
-        content:
-        - type: input_text
-          text: "what teams are playing in this image?"
-      - role: user
-        content:
-        - type: input_image
-          image_url: "https://upload.wikimedia.org/wikipedia/commons/3/3b/LeBron_James_Layup_%28Cleveland_vs_Brooklyn_2018%29.jpg"
-      output: "brooklyn nets"
-
-test_response_multi_turn:
-  test_name: test_response_multi_turn
-  test_params:
-    case:
-    - case_id: "earth"
-      turns:
-      - input: "Which planet do humans live on?"
-        output: "earth"
-      - input: "What is the name of the planet from your previous response?"
-        output: "earth"
-
-test_response_web_search:
-  test_name: test_response_web_search
-  test_params:
-    case:
-    - case_id: "llama_experts"
-      input: "How many experts does the Llama 4 Maverick model have?"
-      tools:
-      - type: web_search
-        search_context_size: "low"
-      output: "128"
-
-test_response_file_search:
-  test_name: test_response_file_search
-  test_params:
-    case:
-    - case_id: "llama_experts"
-      input: "How many experts does the Llama 4 Maverick model have?"
-      tools:
-      - type: file_search
-        # vector_store_ids param for file_search tool gets added by the test runner
-      file_content: "Llama 4 Maverick has 128 experts"
-      output: "128"
-    - case_id: "llama_experts_pdf"
-      input: "How many experts does the Llama 4 Maverick model have?"
-      tools:
-      - type: file_search
-        # vector_store_ids param for file_search toolgets added by the test runner
-      file_path: "pdfs/llama_stack_and_models.pdf"
-      output: "128"
-
-test_response_mcp_tool:
-  test_name: test_response_mcp_tool
-  test_params:
-    case:
-    - case_id: "boiling_point_tool"
-      input: "What is the boiling point of myawesomeliquid in Celsius?"
-      tools:
-      - type: mcp
-        server_label: "localmcp"
-        server_url: "<FILLED_BY_TEST_RUNNER>"
-      output: "Hello, world!"
-
-test_response_custom_tool:
-  test_name: test_response_custom_tool
-  test_params:
-    case:
-    - case_id: "sf_weather"
-      input: "What's the weather like in San Francisco?"
-      tools:
-      - type: function
-        name: get_weather
-        description: Get current temperature for a given location.
-        parameters:
-          additionalProperties: false
-          properties:
-            location:
-              description: "City and country e.g. Bogot\xE1, Colombia"
-              type: string
-          required:
-          - location
-          type: object
-
-test_response_image:
-  test_name: test_response_image
-  test_params:
-    case:
-    - case_id: "llama_image"
-      input:
-      - role: user
-        content:
-        - type: input_text
-          text: "Identify the type of animal in this image."
-        - type: input_image
-          image_url: "https://upload.wikimedia.org/wikipedia/commons/f/f7/Llamas%2C_Vernagt-Stausee%2C_Italy.jpg"
-      output: "llama"
-
-# the models are really poor at tool calling after seeing images :/
-test_response_multi_turn_image:
-  test_name: test_response_multi_turn_image
-  test_params:
-    case:
-    - case_id: "llama_image_understanding"
-      turns:
-      - input:
-        - role: user
-          content:
-          - type: input_text
-            text: "What type of animal is in this image? Please respond with a single word that starts with the letter 'L'."
-          - type: input_image
-            image_url: "https://upload.wikimedia.org/wikipedia/commons/f/f7/Llamas%2C_Vernagt-Stausee%2C_Italy.jpg"
-        output: "llama"
-      - input: "What country do you find this animal primarily in? What continent?"
-        output: "peru"
-
-test_response_multi_turn_tool_execution:
-  test_name: test_response_multi_turn_tool_execution
-  test_params:
-    case:
-    - case_id: "user_file_access_check"
-      input: "I need to check if user 'alice' can access the file 'document.txt'. First, get alice's user ID, then check if that user ID can access the file 'document.txt'. Do this as a series of steps, where each step is a separate message. Return only one tool call per step. Summarize the final result with a single 'yes' or 'no' response."
-      tools:
-      - type: mcp
-        server_label: "localmcp"
-        server_url: "<FILLED_BY_TEST_RUNNER>"
-      output: "yes"
-    - case_id: "experiment_results_lookup"
-      input: "I need to get the results for the 'boiling_point' experiment. First, get the experiment ID for 'boiling_point', then use that ID to get the experiment results. Tell me the boiling point in Celsius."
-      tools:
-      - type: mcp
-        server_label: "localmcp"
-        server_url: "<FILLED_BY_TEST_RUNNER>"
-      output: "100°C"
-
-test_response_multi_turn_tool_execution_streaming:
-  test_name: test_response_multi_turn_tool_execution_streaming
-  test_params:
-    case:
-    - case_id: "user_permissions_workflow"
-      input: "Help me with this security check: First, get the user ID for 'charlie', then get the permissions for that user ID, and finally check if that user can access 'secret_file.txt'. Stream your progress as you work through each step. Return only one tool call per step. Summarize the final result with a single 'yes' or 'no' response."
-      tools:
-      - type: mcp
-        server_label: "localmcp"
-        server_url: "<FILLED_BY_TEST_RUNNER>"
-      stream: true
-      output: "no"
-    - case_id: "experiment_analysis_streaming"
-      input: "I need a complete analysis: First, get the experiment ID for 'chemical_reaction', then get the results for that experiment, and tell me if the yield was above 80%. Return only one tool call per step.  Please stream your analysis process."
-      tools:
-      - type: mcp
-        server_label: "localmcp"
-        server_url: "<FILLED_BY_TEST_RUNNER>"
-      stream: true
-      output: "85%"
--- a/tests/integration/non_ci/responses/helpers.py
+++ b/tests/integration/non_ci/responses/helpers.py
@ -0,0 +1,64 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+import time
+
+
+def new_vector_store(openai_client, name):
+    """Create a new vector store, cleaning up any existing one with the same name."""
+    # Ensure we don't reuse an existing vector store
+    vector_stores = openai_client.vector_stores.list()
+    for vector_store in vector_stores:
+        if vector_store.name == name:
+            openai_client.vector_stores.delete(vector_store_id=vector_store.id)
+
+    # Create a new vector store
+    vector_store = openai_client.vector_stores.create(name=name)
+    return vector_store
+
+
+def upload_file(openai_client, name, file_path):
+    """Upload a file, cleaning up any existing file with the same name."""
+    # Ensure we don't reuse an existing file
+    files = openai_client.files.list()
+    for file in files:
+        if file.filename == name:
+            openai_client.files.delete(file_id=file.id)
+
+    # Upload a text file with our document content
+    return openai_client.files.create(file=open(file_path, "rb"), purpose="assistants")
+
+
+def wait_for_file_attachment(compat_client, vector_store_id, file_id):
+    """Wait for a file to be attached to a vector store."""
+    file_attach_response = compat_client.vector_stores.files.retrieve(
+        vector_store_id=vector_store_id,
+        file_id=file_id,
+    )
+
+    while file_attach_response.status == "in_progress":
+        time.sleep(0.1)
+        file_attach_response = compat_client.vector_stores.files.retrieve(
+            vector_store_id=vector_store_id,
+            file_id=file_id,
+        )
+
+    assert file_attach_response.status == "completed", f"Expected file to be attached, got {file_attach_response}"
+    assert not file_attach_response.last_error
+    return file_attach_response
+
+
+def setup_mcp_tools(tools, mcp_server_info):
+    """Replace placeholder MCP server URLs with actual server info."""
+    # Create a deep copy to avoid modifying the original test case
+    import copy
+
+    tools_copy = copy.deepcopy(tools)
+
+    for tool in tools_copy:
+        if tool["type"] == "mcp" and tool["server_url"] == "<FILLED_BY_TEST_RUNNER>":
+            tool["server_url"] = mcp_server_info["server_url"]
+    return tools_copy
--- a/tests/integration/non_ci/responses/streaming_assertions.py
+++ b/tests/integration/non_ci/responses/streaming_assertions.py
@ -0,0 +1,145 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+from typing import Any
+
+
+class StreamingValidator:
+    """Helper class for validating streaming response events."""
+
+    def __init__(self, chunks: list[Any]):
+        self.chunks = chunks
+        self.event_types = [chunk.type for chunk in chunks]
+
+    def assert_basic_event_sequence(self):
+        """Verify basic created -> completed event sequence."""
+        assert len(self.chunks) >= 2, f"Expected at least 2 chunks (created + completed), got {len(self.chunks)}"
+        assert self.chunks[0].type == "response.created", (
+            f"First chunk should be response.created, got {self.chunks[0].type}"
+        )
+        assert self.chunks[-1].type == "response.completed", (
+            f"Last chunk should be response.completed, got {self.chunks[-1].type}"
+        )
+
+        # Verify event order
+        created_index = self.event_types.index("response.created")
+        completed_index = self.event_types.index("response.completed")
+        assert created_index < completed_index, "response.created should come before response.completed"
+
+    def assert_response_consistency(self):
+        """Verify response ID consistency across events."""
+        response_ids = set()
+        for chunk in self.chunks:
+            if hasattr(chunk, "response_id"):
+                response_ids.add(chunk.response_id)
+            elif hasattr(chunk, "response") and hasattr(chunk.response, "id"):
+                response_ids.add(chunk.response.id)
+
+        assert len(response_ids) == 1, f"All events should reference the same response_id, found: {response_ids}"
+
+    def assert_has_incremental_content(self):
+        """Verify that content is delivered incrementally via delta events."""
+        delta_events = [
+            i for i, event_type in enumerate(self.event_types) if event_type == "response.output_text.delta"
+        ]
+        assert len(delta_events) > 0, "Expected delta events for true incremental streaming, but found none"
+
+        # Verify delta events have content
+        non_empty_deltas = 0
+        delta_content_total = ""
+
+        for delta_idx in delta_events:
+            chunk = self.chunks[delta_idx]
+            if hasattr(chunk, "delta") and chunk.delta:
+                delta_content_total += chunk.delta
+                non_empty_deltas += 1
+
+        assert non_empty_deltas > 0, "Delta events found but none contain content"
+        assert len(delta_content_total) > 0, "Delta events found but total delta content is empty"
+
+        return delta_content_total
+
+    def assert_content_quality(self, expected_content: str):
+        """Verify the final response contains expected content."""
+        final_chunk = self.chunks[-1]
+        if hasattr(final_chunk, "response"):
+            output_text = final_chunk.response.output_text.lower().strip()
+            assert len(output_text) > 0, "Response should have content"
+            assert expected_content.lower() in output_text, f"Expected '{expected_content}' in response"
+
+    def assert_has_tool_calls(self):
+        """Verify tool call streaming events are present."""
+        # Check for tool call events
+        delta_events = [
+            chunk
+            for chunk in self.chunks
+            if chunk.type in ["response.function_call_arguments.delta", "response.mcp_call.arguments.delta"]
+        ]
+        done_events = [
+            chunk
+            for chunk in self.chunks
+            if chunk.type in ["response.function_call_arguments.done", "response.mcp_call.arguments.done"]
+        ]
+
+        assert len(delta_events) > 0, f"Expected tool call delta events, got chunk types: {self.event_types}"
+        assert len(done_events) > 0, f"Expected tool call done events, got chunk types: {self.event_types}"
+
+        # Verify output item events
+        item_added_events = [chunk for chunk in self.chunks if chunk.type == "response.output_item.added"]
+        item_done_events = [chunk for chunk in self.chunks if chunk.type == "response.output_item.done"]
+
+        assert len(item_added_events) > 0, (
+            f"Expected response.output_item.added events, got chunk types: {self.event_types}"
+        )
+        assert len(item_done_events) > 0, (
+            f"Expected response.output_item.done events, got chunk types: {self.event_types}"
+        )
+
+    def assert_has_mcp_events(self):
+        """Verify MCP-specific streaming events are present."""
+        # Tool execution progress events
+        mcp_in_progress_events = [chunk for chunk in self.chunks if chunk.type == "response.mcp_call.in_progress"]
+        mcp_completed_events = [chunk for chunk in self.chunks if chunk.type == "response.mcp_call.completed"]
+
+        assert len(mcp_in_progress_events) > 0, (
+            f"Expected response.mcp_call.in_progress events, got chunk types: {self.event_types}"
+        )
+        assert len(mcp_completed_events) > 0, (
+            f"Expected response.mcp_call.completed events, got chunk types: {self.event_types}"
+        )
+
+        # MCP list tools events
+        mcp_list_tools_in_progress_events = [
+            chunk for chunk in self.chunks if chunk.type == "response.mcp_list_tools.in_progress"
+        ]
+        mcp_list_tools_completed_events = [
+            chunk for chunk in self.chunks if chunk.type == "response.mcp_list_tools.completed"
+        ]
+
+        assert len(mcp_list_tools_in_progress_events) > 0, (
+            f"Expected response.mcp_list_tools.in_progress events, got chunk types: {self.event_types}"
+        )
+        assert len(mcp_list_tools_completed_events) > 0, (
+            f"Expected response.mcp_list_tools.completed events, got chunk types: {self.event_types}"
+        )
+
+    def assert_rich_streaming(self, min_chunks: int = 10):
+        """Verify we have substantial streaming activity."""
+        assert len(self.chunks) > min_chunks, (
+            f"Expected rich streaming with many events, got only {len(self.chunks)} chunks"
+        )
+
+    def validate_event_structure(self):
+        """Validate the structure of various event types."""
+        for chunk in self.chunks:
+            if chunk.type == "response.created":
+                assert chunk.response.status == "in_progress"
+            elif chunk.type == "response.completed":
+                assert chunk.response.status == "completed"
+            elif hasattr(chunk, "item_id"):
+                assert chunk.item_id, "Events with item_id should have non-empty item_id"
+            elif hasattr(chunk, "sequence_number"):
+                assert isinstance(chunk.sequence_number, int), "sequence_number should be an integer"
--- a/tests/integration/non_ci/responses/test_basic_responses.py
+++ b/tests/integration/non_ci/responses/test_basic_responses.py
@ -0,0 +1,189 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+import time
+
+import pytest
+
+from .fixtures.test_cases import basic_test_cases, image_test_cases, multi_turn_image_test_cases, multi_turn_test_cases
+from .streaming_assertions import StreamingValidator
+
+
+@pytest.mark.parametrize("case", basic_test_cases)
+def test_response_non_streaming_basic(compat_client, text_model_id, case):
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input=case.input,
+        stream=False,
+    )
+    output_text = response.output_text.lower().strip()
+    assert len(output_text) > 0
+    assert case.expected.lower() in output_text
+
+    retrieved_response = compat_client.responses.retrieve(response_id=response.id)
+    assert retrieved_response.output_text == response.output_text
+
+    next_response = compat_client.responses.create(
+        model=text_model_id,
+        input="Repeat your previous response in all caps.",
+        previous_response_id=response.id,
+    )
+    next_output_text = next_response.output_text.strip()
+    assert case.expected.upper() in next_output_text
+
+
+@pytest.mark.parametrize("case", basic_test_cases)
+def test_response_streaming_basic(compat_client, text_model_id, case):
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input=case.input,
+        stream=True,
+    )
+
+    # Track events and timing to verify proper streaming
+    events = []
+    event_times = []
+    response_id = ""
+
+    start_time = time.time()
+
+    for chunk in response:
+        current_time = time.time()
+        event_times.append(current_time - start_time)
+        events.append(chunk)
+
+        if chunk.type == "response.created":
+            # Verify response.created is emitted first and immediately
+            assert len(events) == 1, "response.created should be the first event"
+            assert event_times[0] < 0.1, "response.created should be emitted immediately"
+            assert chunk.response.status == "in_progress"
+            response_id = chunk.response.id
+
+        elif chunk.type == "response.completed":
+            # Verify response.completed comes after response.created
+            assert len(events) >= 2, "response.completed should come after response.created"
+            assert chunk.response.status == "completed"
+            assert chunk.response.id == response_id, "Response ID should be consistent"
+
+            # Verify content quality
+            output_text = chunk.response.output_text.lower().strip()
+            assert len(output_text) > 0, "Response should have content"
+            assert case.expected.lower() in output_text, f"Expected '{case.expected}' in response"
+
+    # Use validator for common checks
+    validator = StreamingValidator(events)
+    validator.assert_basic_event_sequence()
+    validator.assert_response_consistency()
+
+    # Verify stored response matches streamed response
+    retrieved_response = compat_client.responses.retrieve(response_id=response_id)
+    final_event = events[-1]
+    assert retrieved_response.output_text == final_event.response.output_text
+
+
+@pytest.mark.parametrize("case", basic_test_cases)
+def test_response_streaming_incremental_content(compat_client, text_model_id, case):
+    """Test that streaming actually delivers content incrementally, not just at the end."""
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input=case.input,
+        stream=True,
+    )
+
+    # Track all events and their content to verify incremental streaming
+    events = []
+    content_snapshots = []
+    event_times = []
+
+    start_time = time.time()
+
+    for chunk in response:
+        current_time = time.time()
+        event_times.append(current_time - start_time)
+        events.append(chunk)
+
+        # Track content at each event based on event type
+        if chunk.type == "response.output_text.delta":
+            # For delta events, track the delta content
+            content_snapshots.append(chunk.delta)
+        elif hasattr(chunk, "response") and hasattr(chunk.response, "output_text"):
+            # For response.created/completed events, track the full output_text
+            content_snapshots.append(chunk.response.output_text)
+        else:
+            content_snapshots.append("")
+
+    validator = StreamingValidator(events)
+    validator.assert_basic_event_sequence()
+
+    # Check if we have incremental content updates
+    event_types = [event.type for event in events]
+    created_index = event_types.index("response.created")
+    completed_index = event_types.index("response.completed")
+
+    # The key test: verify content progression
+    created_content = content_snapshots[created_index]
+    completed_content = content_snapshots[completed_index]
+
+    # Verify that response.created has empty or minimal content
+    assert len(created_content) == 0, f"response.created should have empty content, got: {repr(created_content[:100])}"
+
+    # Verify that response.completed has the full content
+    assert len(completed_content) > 0, "response.completed should have content"
+    assert case.expected.lower() in completed_content.lower(), f"Expected '{case.expected}' in final content"
+
+    # Use validator for incremental content checks
+    delta_content_total = validator.assert_has_incremental_content()
+
+    # Verify that the accumulated delta content matches the final content
+    assert delta_content_total.strip() == completed_content.strip(), (
+        f"Delta content '{delta_content_total}' should match final content '{completed_content}'"
+    )
+
+    # Verify timing: delta events should come between created and completed
+    delta_events = [i for i, event_type in enumerate(event_types) if event_type == "response.output_text.delta"]
+    for delta_idx in delta_events:
+        assert created_index < delta_idx < completed_index, (
+            f"Delta event at index {delta_idx} should be between created ({created_index}) and completed ({completed_index})"
+        )
+
+
+@pytest.mark.parametrize("case", multi_turn_test_cases)
+def test_response_non_streaming_multi_turn(compat_client, text_model_id, case):
+    previous_response_id = None
+    for turn_input, turn_expected in case.turns:
+        response = compat_client.responses.create(
+            model=text_model_id,
+            input=turn_input,
+            previous_response_id=previous_response_id,
+        )
+        previous_response_id = response.id
+        output_text = response.output_text.lower()
+        assert turn_expected.lower() in output_text
+
+
+@pytest.mark.parametrize("case", image_test_cases)
+def test_response_non_streaming_image(compat_client, text_model_id, case):
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input=case.input,
+        stream=False,
+    )
+    output_text = response.output_text.lower()
+    assert case.expected.lower() in output_text
+
+
+@pytest.mark.parametrize("case", multi_turn_image_test_cases)
+def test_response_non_streaming_multi_turn_image(compat_client, text_model_id, case):
+    previous_response_id = None
+    for turn_input, turn_expected in case.turns:
+        response = compat_client.responses.create(
+            model=text_model_id,
+            input=turn_input,
+            previous_response_id=previous_response_id,
+        )
+        previous_response_id = response.id
+        output_text = response.output_text.lower()
+        assert turn_expected.lower() in output_text
--- a/tests/integration/non_ci/responses/test_file_search.py
+++ b/tests/integration/non_ci/responses/test_file_search.py
@ -0,0 +1,318 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+import json
+import time
+
+import pytest
+
+from llama_stack import LlamaStackAsLibraryClient
+
+from .helpers import new_vector_store, upload_file
+
+
+@pytest.mark.parametrize(
+    "text_format",
+    # Not testing json_object because most providers don't actually support it.
+    [
+        {"type": "text"},
+        {
+            "type": "json_schema",
+            "name": "capitals",
+            "description": "A schema for the capital of each country",
+            "schema": {"type": "object", "properties": {"capital": {"type": "string"}}},
+            "strict": True,
+        },
+    ],
+)
+def test_response_text_format(compat_client, text_model_id, text_format):
+    if isinstance(compat_client, LlamaStackAsLibraryClient):
+        pytest.skip("Responses API text format is not yet supported in library client.")
+
+    stream = False
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input="What is the capital of France?",
+        stream=stream,
+        text={"format": text_format},
+    )
+    # by_alias=True is needed because otherwise Pydantic renames our "schema" field
+    assert response.text.format.model_dump(exclude_none=True, by_alias=True) == text_format
+    assert "paris" in response.output_text.lower()
+    if text_format["type"] == "json_schema":
+        assert "paris" in json.loads(response.output_text)["capital"].lower()
+
+
+@pytest.fixture
+def vector_store_with_filtered_files(compat_client, text_model_id, tmp_path_factory):
+    """Create a vector store with multiple files that have different attributes for filtering tests."""
+    if isinstance(compat_client, LlamaStackAsLibraryClient):
+        pytest.skip("Responses API file search is not yet supported in library client.")
+
+    vector_store = new_vector_store(compat_client, "test_vector_store_with_filters")
+    tmp_path = tmp_path_factory.mktemp("filter_test_files")
+
+    # Create multiple files with different attributes
+    files_data = [
+        {
+            "name": "us_marketing_q1.txt",
+            "content": "US promotional campaigns for Q1 2023. Revenue increased by 15% in the US region.",
+            "attributes": {
+                "region": "us",
+                "category": "marketing",
+                "date": 1672531200,  # Jan 1, 2023
+            },
+        },
+        {
+            "name": "us_engineering_q2.txt",
+            "content": "US technical updates for Q2 2023. New features deployed in the US region.",
+            "attributes": {
+                "region": "us",
+                "category": "engineering",
+                "date": 1680307200,  # Apr 1, 2023
+            },
+        },
+        {
+            "name": "eu_marketing_q1.txt",
+            "content": "European advertising campaign results for Q1 2023. Strong growth in EU markets.",
+            "attributes": {
+                "region": "eu",
+                "category": "marketing",
+                "date": 1672531200,  # Jan 1, 2023
+            },
+        },
+        {
+            "name": "asia_sales_q3.txt",
+            "content": "Asia Pacific revenue figures for Q3 2023. Record breaking quarter in Asia.",
+            "attributes": {
+                "region": "asia",
+                "category": "sales",
+                "date": 1688169600,  # Jul 1, 2023
+            },
+        },
+    ]
+
+    file_ids = []
+    for file_data in files_data:
+        # Create file
+        file_path = tmp_path / file_data["name"]
+        file_path.write_text(file_data["content"])
+
+        # Upload file
+        file_response = upload_file(compat_client, file_data["name"], str(file_path))
+        file_ids.append(file_response.id)
+
+        # Attach file to vector store with attributes
+        file_attach_response = compat_client.vector_stores.files.create(
+            vector_store_id=vector_store.id,
+            file_id=file_response.id,
+            attributes=file_data["attributes"],
+        )
+
+        # Wait for attachment
+        while file_attach_response.status == "in_progress":
+            time.sleep(0.1)
+            file_attach_response = compat_client.vector_stores.files.retrieve(
+                vector_store_id=vector_store.id,
+                file_id=file_response.id,
+            )
+        assert file_attach_response.status == "completed"
+
+    yield vector_store
+
+    # Cleanup: delete vector store and files
+    try:
+        compat_client.vector_stores.delete(vector_store_id=vector_store.id)
+        for file_id in file_ids:
+            try:
+                compat_client.files.delete(file_id=file_id)
+            except Exception:
+                pass  # File might already be deleted
+    except Exception:
+        pass  # Best effort cleanup
+
+
+def test_response_file_search_filter_by_region(compat_client, text_model_id, vector_store_with_filtered_files):
+    """Test file search with region equality filter."""
+    tools = [
+        {
+            "type": "file_search",
+            "vector_store_ids": [vector_store_with_filtered_files.id],
+            "filters": {"type": "eq", "key": "region", "value": "us"},
+        }
+    ]
+
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input="What are the updates from the US region?",
+        tools=tools,
+        stream=False,
+        include=["file_search_call.results"],
+    )
+
+    # Verify file search was called with US filter
+    assert len(response.output) > 1
+    assert response.output[0].type == "file_search_call"
+    assert response.output[0].status == "completed"
+    assert response.output[0].results
+    # Should only return US files (not EU or Asia files)
+    for result in response.output[0].results:
+        assert "us" in result.text.lower() or "US" in result.text
+        # Ensure non-US regions are NOT returned
+        assert "european" not in result.text.lower()
+        assert "asia" not in result.text.lower()
+
+
+def test_response_file_search_filter_by_category(compat_client, text_model_id, vector_store_with_filtered_files):
+    """Test file search with category equality filter."""
+    tools = [
+        {
+            "type": "file_search",
+            "vector_store_ids": [vector_store_with_filtered_files.id],
+            "filters": {"type": "eq", "key": "category", "value": "marketing"},
+        }
+    ]
+
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input="Show me all marketing reports",
+        tools=tools,
+        stream=False,
+        include=["file_search_call.results"],
+    )
+
+    assert response.output[0].type == "file_search_call"
+    assert response.output[0].status == "completed"
+    assert response.output[0].results
+    # Should only return marketing files (not engineering or sales)
+    for result in response.output[0].results:
+        # Marketing files should have promotional/advertising content
+        assert "promotional" in result.text.lower() or "advertising" in result.text.lower()
+        # Ensure non-marketing categories are NOT returned
+        assert "technical" not in result.text.lower()
+        assert "revenue figures" not in result.text.lower()
+
+
+def test_response_file_search_filter_by_date_range(compat_client, text_model_id, vector_store_with_filtered_files):
+    """Test file search with date range filter using compound AND."""
+    tools = [
+        {
+            "type": "file_search",
+            "vector_store_ids": [vector_store_with_filtered_files.id],
+            "filters": {
+                "type": "and",
+                "filters": [
+                    {
+                        "type": "gte",
+                        "key": "date",
+                        "value": 1672531200,  # Jan 1, 2023
+                    },
+                    {
+                        "type": "lt",
+                        "key": "date",
+                        "value": 1680307200,  # Apr 1, 2023
+                    },
+                ],
+            },
+        }
+    ]
+
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input="What happened in Q1 2023?",
+        tools=tools,
+        stream=False,
+        include=["file_search_call.results"],
+    )
+
+    assert response.output[0].type == "file_search_call"
+    assert response.output[0].status == "completed"
+    assert response.output[0].results
+    # Should only return Q1 files (not Q2 or Q3)
+    for result in response.output[0].results:
+        assert "q1" in result.text.lower()
+        # Ensure non-Q1 quarters are NOT returned
+        assert "q2" not in result.text.lower()
+        assert "q3" not in result.text.lower()
+
+
+def test_response_file_search_filter_compound_and(compat_client, text_model_id, vector_store_with_filtered_files):
+    """Test file search with compound AND filter (region AND category)."""
+    tools = [
+        {
+            "type": "file_search",
+            "vector_store_ids": [vector_store_with_filtered_files.id],
+            "filters": {
+                "type": "and",
+                "filters": [
+                    {"type": "eq", "key": "region", "value": "us"},
+                    {"type": "eq", "key": "category", "value": "engineering"},
+                ],
+            },
+        }
+    ]
+
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input="What are the engineering updates from the US?",
+        tools=tools,
+        stream=False,
+        include=["file_search_call.results"],
+    )
+
+    assert response.output[0].type == "file_search_call"
+    assert response.output[0].status == "completed"
+    assert response.output[0].results
+    # Should only return US engineering files
+    assert len(response.output[0].results) >= 1
+    for result in response.output[0].results:
+        assert "us" in result.text.lower() and "technical" in result.text.lower()
+        # Ensure it's not from other regions or categories
+        assert "european" not in result.text.lower() and "asia" not in result.text.lower()
+        assert "promotional" not in result.text.lower() and "revenue" not in result.text.lower()
+
+
+def test_response_file_search_filter_compound_or(compat_client, text_model_id, vector_store_with_filtered_files):
+    """Test file search with compound OR filter (marketing OR sales)."""
+    tools = [
+        {
+            "type": "file_search",
+            "vector_store_ids": [vector_store_with_filtered_files.id],
+            "filters": {
+                "type": "or",
+                "filters": [
+                    {"type": "eq", "key": "category", "value": "marketing"},
+                    {"type": "eq", "key": "category", "value": "sales"},
+                ],
+            },
+        }
+    ]
+
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input="Show me marketing and sales documents",
+        tools=tools,
+        stream=False,
+        include=["file_search_call.results"],
+    )
+
+    assert response.output[0].type == "file_search_call"
+    assert response.output[0].status == "completed"
+    assert response.output[0].results
+    # Should return marketing and sales files, but NOT engineering
+    categories_found = set()
+    for result in response.output[0].results:
+        text_lower = result.text.lower()
+        if "promotional" in text_lower or "advertising" in text_lower:
+            categories_found.add("marketing")
+        if "revenue figures" in text_lower:
+            categories_found.add("sales")
+        # Ensure engineering files are NOT returned
+        assert "technical" not in text_lower, f"Engineering file should not be returned, but got: {result.text}"
+
+    # Verify we got at least one of the expected categories
+    assert len(categories_found) > 0, "Should have found at least one marketing or sales file"
+    assert categories_found.issubset({"marketing", "sales"}), f"Found unexpected categories: {categories_found}"
--- a/tests/integration/non_ci/responses/test_responses.py
+++ b/tests/integration/non_ci/responses/test_responses.py
--- a/tests/integration/non_ci/responses/test_tool_responses.py
+++ b/tests/integration/non_ci/responses/test_tool_responses.py
@ -0,0 +1,474 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+import json
+import os
+
+import httpx
+import openai
+import pytest
+
+from llama_stack import LlamaStackAsLibraryClient
+from llama_stack.core.datatypes import AuthenticationRequiredError
+from tests.common.mcp import dependency_tools, make_mcp_server
+
+from .fixtures.test_cases import (
+    custom_tool_test_cases,
+    file_search_test_cases,
+    mcp_tool_test_cases,
+    multi_turn_tool_execution_streaming_test_cases,
+    multi_turn_tool_execution_test_cases,
+    web_search_test_cases,
+)
+from .helpers import new_vector_store, setup_mcp_tools, upload_file, wait_for_file_attachment
+from .streaming_assertions import StreamingValidator
+
+
+@pytest.mark.parametrize("case", web_search_test_cases)
+def test_response_non_streaming_web_search(compat_client, text_model_id, case):
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input=case.input,
+        tools=case.tools,
+        stream=False,
+    )
+    assert len(response.output) > 1
+    assert response.output[0].type == "web_search_call"
+    assert response.output[0].status == "completed"
+    assert response.output[1].type == "message"
+    assert response.output[1].status == "completed"
+    assert response.output[1].role == "assistant"
+    assert len(response.output[1].content) > 0
+    assert case.expected.lower() in response.output_text.lower().strip()
+
+
+@pytest.mark.parametrize("case", file_search_test_cases)
+def test_response_non_streaming_file_search(compat_client, text_model_id, tmp_path, case):
+    if isinstance(compat_client, LlamaStackAsLibraryClient):
+        pytest.skip("Responses API file search is not yet supported in library client.")
+
+    vector_store = new_vector_store(compat_client, "test_vector_store")
+
+    if case.file_content:
+        file_name = "test_response_non_streaming_file_search.txt"
+        file_path = tmp_path / file_name
+        file_path.write_text(case.file_content)
+    elif case.file_path:
+        file_path = os.path.join(os.path.dirname(__file__), "fixtures", case.file_path)
+        file_name = os.path.basename(file_path)
+    else:
+        raise ValueError("No file content or path provided for case")
+
+    file_response = upload_file(compat_client, file_name, file_path)
+
+    # Attach our file to the vector store
+    compat_client.vector_stores.files.create(
+        vector_store_id=vector_store.id,
+        file_id=file_response.id,
+    )
+
+    # Wait for the file to be attached
+    wait_for_file_attachment(compat_client, vector_store.id, file_response.id)
+
+    # Update our tools with the right vector store id
+    tools = case.tools
+    for tool in tools:
+        if tool["type"] == "file_search":
+            tool["vector_store_ids"] = [vector_store.id]
+
+    # Create the response request, which should query our vector store
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input=case.input,
+        tools=tools,
+        stream=False,
+        include=["file_search_call.results"],
+    )
+
+    # Verify the file_search_tool was called
+    assert len(response.output) > 1
+    assert response.output[0].type == "file_search_call"
+    assert response.output[0].status == "completed"
+    assert response.output[0].queries  # ensure it's some non-empty list
+    assert response.output[0].results
+    assert case.expected.lower() in response.output[0].results[0].text.lower()
+    assert response.output[0].results[0].score > 0
+
+    # Verify the output_text generated by the response
+    assert case.expected.lower() in response.output_text.lower().strip()
+
+
+def test_response_non_streaming_file_search_empty_vector_store(compat_client, text_model_id):
+    if isinstance(compat_client, LlamaStackAsLibraryClient):
+        pytest.skip("Responses API file search is not yet supported in library client.")
+
+    vector_store = new_vector_store(compat_client, "test_vector_store")
+
+    # Create the response request, which should query our vector store
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input="How many experts does the Llama 4 Maverick model have?",
+        tools=[{"type": "file_search", "vector_store_ids": [vector_store.id]}],
+        stream=False,
+        include=["file_search_call.results"],
+    )
+
+    # Verify the file_search_tool was called
+    assert len(response.output) > 1
+    assert response.output[0].type == "file_search_call"
+    assert response.output[0].status == "completed"
+    assert response.output[0].queries  # ensure it's some non-empty list
+    assert not response.output[0].results  # ensure we don't get any results
+
+    # Verify some output_text was generated by the response
+    assert response.output_text
+
+
+@pytest.mark.parametrize("case", mcp_tool_test_cases)
+def test_response_non_streaming_mcp_tool(compat_client, text_model_id, case):
+    if not isinstance(compat_client, LlamaStackAsLibraryClient):
+        pytest.skip("in-process MCP server is only supported in library client")
+
+    with make_mcp_server() as mcp_server_info:
+        tools = setup_mcp_tools(case.tools, mcp_server_info)
+
+        response = compat_client.responses.create(
+            model=text_model_id,
+            input=case.input,
+            tools=tools,
+            stream=False,
+        )
+
+        assert len(response.output) >= 3
+        list_tools = response.output[0]
+        assert list_tools.type == "mcp_list_tools"
+        assert list_tools.server_label == "localmcp"
+        assert len(list_tools.tools) == 2
+        assert {t.name for t in list_tools.tools} == {
+            "get_boiling_point",
+            "greet_everyone",
+        }
+
+        call = response.output[1]
+        assert call.type == "mcp_call"
+        assert call.name == "get_boiling_point"
+        assert json.loads(call.arguments) == {
+            "liquid_name": "myawesomeliquid",
+            "celsius": True,
+        }
+        assert call.error is None
+        assert "-100" in call.output
+
+        # sometimes the model will call the tool again, so we need to get the last message
+        message = response.output[-1]
+        text_content = message.content[0].text
+        assert "boiling point" in text_content.lower()
+
+    with make_mcp_server(required_auth_token="test-token") as mcp_server_info:
+        tools = setup_mcp_tools(case.tools, mcp_server_info)
+
+        exc_type = (
+            AuthenticationRequiredError
+            if isinstance(compat_client, LlamaStackAsLibraryClient)
+            else (httpx.HTTPStatusError, openai.AuthenticationError)
+        )
+        with pytest.raises(exc_type):
+            compat_client.responses.create(
+                model=text_model_id,
+                input=case.input,
+                tools=tools,
+                stream=False,
+            )
+
+        for tool in tools:
+            if tool["type"] == "mcp":
+                tool["headers"] = {"Authorization": "Bearer test-token"}
+
+        response = compat_client.responses.create(
+            model=text_model_id,
+            input=case.input,
+            tools=tools,
+            stream=False,
+        )
+        assert len(response.output) >= 3
+
+
+@pytest.mark.parametrize("case", mcp_tool_test_cases)
+def test_response_sequential_mcp_tool(compat_client, text_model_id, case):
+    if not isinstance(compat_client, LlamaStackAsLibraryClient):
+        pytest.skip("in-process MCP server is only supported in library client")
+
+    with make_mcp_server() as mcp_server_info:
+        tools = setup_mcp_tools(case.tools, mcp_server_info)
+
+        response = compat_client.responses.create(
+            model=text_model_id,
+            input=case.input,
+            tools=tools,
+            stream=False,
+        )
+
+        assert len(response.output) >= 3
+        list_tools = response.output[0]
+        assert list_tools.type == "mcp_list_tools"
+        assert list_tools.server_label == "localmcp"
+        assert len(list_tools.tools) == 2
+        assert {t.name for t in list_tools.tools} == {
+            "get_boiling_point",
+            "greet_everyone",
+        }
+
+        call = response.output[1]
+        assert call.type == "mcp_call"
+        assert call.name == "get_boiling_point"
+        assert json.loads(call.arguments) == {
+            "liquid_name": "myawesomeliquid",
+            "celsius": True,
+        }
+        assert call.error is None
+        assert "-100" in call.output
+
+        # sometimes the model will call the tool again, so we need to get the last message
+        message = response.output[-1]
+        text_content = message.content[0].text
+        assert "boiling point" in text_content.lower()
+
+        response2 = compat_client.responses.create(
+            model=text_model_id, input=case.input, tools=tools, stream=False, previous_response_id=response.id
+        )
+
+        assert len(response2.output) >= 1
+        message = response2.output[-1]
+        text_content = message.content[0].text
+        assert "boiling point" in text_content.lower()
+
+
+@pytest.mark.parametrize("case", custom_tool_test_cases)
+def test_response_non_streaming_custom_tool(compat_client, text_model_id, case):
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input=case.input,
+        tools=case.tools,
+        stream=False,
+    )
+    assert len(response.output) == 1
+    assert response.output[0].type == "function_call"
+    assert response.output[0].status == "completed"
+    assert response.output[0].name == "get_weather"
+
+
+@pytest.mark.parametrize("case", custom_tool_test_cases)
+def test_response_function_call_ordering_1(compat_client, text_model_id, case):
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input=case.input,
+        tools=case.tools,
+        stream=False,
+    )
+    assert len(response.output) == 1
+    assert response.output[0].type == "function_call"
+    assert response.output[0].status == "completed"
+    assert response.output[0].name == "get_weather"
+    inputs = []
+    inputs.append(
+        {
+            "role": "user",
+            "content": case.input,
+        }
+    )
+    inputs.append(
+        {
+            "type": "function_call_output",
+            "output": "It is raining.",
+            "call_id": response.output[0].call_id,
+        }
+    )
+    response = compat_client.responses.create(
+        model=text_model_id, input=inputs, tools=case.tools, stream=False, previous_response_id=response.id
+    )
+    assert len(response.output) == 1
+
+
+def test_response_function_call_ordering_2(compat_client, text_model_id):
+    tools = [
+        {
+            "type": "function",
+            "name": "get_weather",
+            "description": "Get current temperature for a given location.",
+            "parameters": {
+                "additionalProperties": False,
+                "properties": {
+                    "location": {
+                        "description": "City and country e.g. Bogotá, Colombia",
+                        "type": "string",
+                    }
+                },
+                "required": ["location"],
+                "type": "object",
+            },
+        }
+    ]
+    inputs = [
+        {
+            "role": "user",
+            "content": "Is the weather better in San Francisco or Los Angeles?",
+        }
+    ]
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input=inputs,
+        tools=tools,
+        stream=False,
+    )
+    for output in response.output:
+        if output.type == "function_call" and output.status == "completed" and output.name == "get_weather":
+            inputs.append(output)
+    for output in response.output:
+        if output.type == "function_call" and output.status == "completed" and output.name == "get_weather":
+            weather = "It is raining."
+            if "Los Angeles" in output.arguments:
+                weather = "It is cloudy."
+            inputs.append(
+                {
+                    "type": "function_call_output",
+                    "output": weather,
+                    "call_id": output.call_id,
+                }
+            )
+    response = compat_client.responses.create(
+        model=text_model_id,
+        input=inputs,
+        tools=tools,
+        stream=False,
+    )
+    assert len(response.output) == 1
+    assert "Los Angeles" in response.output_text
+
+
+@pytest.mark.parametrize("case", multi_turn_tool_execution_test_cases)
+def test_response_non_streaming_multi_turn_tool_execution(compat_client, text_model_id, case):
+    """Test multi-turn tool execution where multiple MCP tool calls are performed in sequence."""
+    if not isinstance(compat_client, LlamaStackAsLibraryClient):
+        pytest.skip("in-process MCP server is only supported in library client")
+
+    with make_mcp_server(tools=dependency_tools()) as mcp_server_info:
+        tools = setup_mcp_tools(case.tools, mcp_server_info)
+
+        response = compat_client.responses.create(
+            input=case.input,
+            model=text_model_id,
+            tools=tools,
+        )
+
+        # Verify we have MCP tool calls in the output
+        mcp_list_tools = [output for output in response.output if output.type == "mcp_list_tools"]
+        mcp_calls = [output for output in response.output if output.type == "mcp_call"]
+        message_outputs = [output for output in response.output if output.type == "message"]
+
+        # Should have exactly 1 MCP list tools message (at the beginning)
+        assert len(mcp_list_tools) == 1, f"Expected exactly 1 mcp_list_tools, got {len(mcp_list_tools)}"
+        assert mcp_list_tools[0].server_label == "localmcp"
+        assert len(mcp_list_tools[0].tools) == 5  # Updated for dependency tools
+        expected_tool_names = {
+            "get_user_id",
+            "get_user_permissions",
+            "check_file_access",
+            "get_experiment_id",
+            "get_experiment_results",
+        }
+        assert {t.name for t in mcp_list_tools[0].tools} == expected_tool_names
+
+        assert len(mcp_calls) >= 1, f"Expected at least 1 mcp_call, got {len(mcp_calls)}"
+        for mcp_call in mcp_calls:
+            assert mcp_call.error is None, f"MCP call should not have errors, got: {mcp_call.error}"
+
+        assert len(message_outputs) >= 1, f"Expected at least 1 message output, got {len(message_outputs)}"
+
+        final_message = message_outputs[-1]
+        assert final_message.role == "assistant", f"Final message should be from assistant, got {final_message.role}"
+        assert final_message.status == "completed", f"Final message should be completed, got {final_message.status}"
+        assert len(final_message.content) > 0, "Final message should have content"
+
+        expected_output = case.expected
+        assert expected_output.lower() in response.output_text.lower(), (
+            f"Expected '{expected_output}' to appear in response: {response.output_text}"
+        )
+
+
+@pytest.mark.parametrize("case", multi_turn_tool_execution_streaming_test_cases)
+def test_response_streaming_multi_turn_tool_execution(compat_client, text_model_id, case):
+    """Test streaming multi-turn tool execution where multiple MCP tool calls are performed in sequence."""
+    if not isinstance(compat_client, LlamaStackAsLibraryClient):
+        pytest.skip("in-process MCP server is only supported in library client")
+
+    with make_mcp_server(tools=dependency_tools()) as mcp_server_info:
+        tools = setup_mcp_tools(case.tools, mcp_server_info)
+
+        stream = compat_client.responses.create(
+            input=case.input,
+            model=text_model_id,
+            tools=tools,
+            stream=True,
+        )
+
+        chunks = []
+        for chunk in stream:
+            chunks.append(chunk)
+
+        # Use validator for common streaming checks
+        validator = StreamingValidator(chunks)
+        validator.assert_basic_event_sequence()
+        validator.assert_response_consistency()
+        validator.assert_has_tool_calls()
+        validator.assert_has_mcp_events()
+        validator.assert_rich_streaming()
+
+        # Get the final response from the last chunk
+        final_chunk = chunks[-1]
+        if hasattr(final_chunk, "response"):
+            final_response = final_chunk.response
+
+            # Verify multi-turn MCP tool execution results
+            mcp_list_tools = [output for output in final_response.output if output.type == "mcp_list_tools"]
+            mcp_calls = [output for output in final_response.output if output.type == "mcp_call"]
+            message_outputs = [output for output in final_response.output if output.type == "message"]
+
+            # Should have exactly 1 MCP list tools message (at the beginning)
+            assert len(mcp_list_tools) == 1, f"Expected exactly 1 mcp_list_tools, got {len(mcp_list_tools)}"
+            assert mcp_list_tools[0].server_label == "localmcp"
+            assert len(mcp_list_tools[0].tools) == 5  # Updated for dependency tools
+            expected_tool_names = {
+                "get_user_id",
+                "get_user_permissions",
+                "check_file_access",
+                "get_experiment_id",
+                "get_experiment_results",
+            }
+            assert {t.name for t in mcp_list_tools[0].tools} == expected_tool_names
+
+            # Should have at least 1 MCP call (the model should call at least one tool)
+            assert len(mcp_calls) >= 1, f"Expected at least 1 mcp_call, got {len(mcp_calls)}"
+
+            # All MCP calls should be completed (verifies our tool execution works)
+            for mcp_call in mcp_calls:
+                assert mcp_call.error is None, f"MCP call should not have errors, got: {mcp_call.error}"
+
+            # Should have at least one final message response
+            assert len(message_outputs) >= 1, f"Expected at least 1 message output, got {len(message_outputs)}"
+
+            # Final message should be from assistant and completed
+            final_message = message_outputs[-1]
+            assert final_message.role == "assistant", (
+                f"Final message should be from assistant, got {final_message.role}"
+            )
+            assert final_message.status == "completed", f"Final message should be completed, got {final_message.status}"
+            assert len(final_message.content) > 0, "Final message should have content"
+
+            # Check that the expected output appears in the response
+            expected_output = case.expected
+            assert expected_output.lower() in final_response.output_text.lower(), (
+                f"Expected '{expected_output}' to appear in response: {final_response.output_text}"
+            )
--- a/tests/integration/post_training/test_post_training.py
+++ b/tests/integration/post_training/test_post_training.py
@ -4,7 +4,6 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-import logging
 import sys
 import time
 import uuid
@ -19,10 +18,10 @@ from llama_stack.apis.post_training import (
    LoraFinetuningConfig,
    TrainingConfig,
 )
+from llama_stack.log import get_logger

 # Configure logging
-logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s", force=True)
-logger = logging.getLogger(__name__)
+logger = get_logger(name=__name__, category="post_training")


 skip_because_resource_intensive = pytest.mark.skip(
--- a/tests/integration/recordings/index.sqlite
+++ b/tests/integration/recordings/index.sqlite
--- a/tests/integration/recordings/responses/390f0c7dac96.json
+++ b/tests/integration/recordings/responses/390f0c7dac96.json
@ -0,0 +1,39 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://localhost:11434/api/generate",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b-instruct-fp16",
+      "raw": true,
+      "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTest metrics generation 1<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
+      "options": {
+        "temperature": 0.0
+      },
+      "stream": false
+    },
+    "endpoint": "/api/generate",
+    "model": "llama3.2:3b-instruct-fp16"
+  },
+  "response": {
+    "body": {
+      "__type__": "ollama._types.GenerateResponse",
+      "__data__": {
+        "model": "llama3.2:3b-instruct-fp16",
+        "created_at": "2025-08-11T15:51:18.170868Z",
+        "done": true,
+        "done_reason": "stop",
+        "total_duration": 5240614083,
+        "load_duration": 9823416,
+        "prompt_eval_count": 21,
+        "prompt_eval_duration": 21000000,
+        "eval_count": 310,
+        "eval_duration": 5209000000,
+        "response": "This is the start of a test. I'll provide some sample data and you can try to generate metrics based on it.\n\n**Data:**\n\nLet's say we have a dataset of user interactions with an e-commerce website. The data includes:\n\n| User ID | Product Name | Purchase Date | Quantity | Price |\n| --- | --- | --- | --- | --- |\n| 1 | iPhone 13 | 2022-01-01 | 2 | 999.99 |\n| 1 | MacBook Air | 2022-01-05 | 1 | 1299.99 |\n| 2 | Samsung TV | 2022-01-10 | 3 | 899.99 |\n| 3 | iPhone 13 | 2022-01-15 | 1 | 999.99 |\n| 4 | MacBook Pro | 2022-01-20 | 2 | 1799.99 |\n\n**Task:**\n\nYour task is to generate the following metrics based on this data:\n\n1. Average order value (AOV)\n2. Conversion rate\n3. Average revenue per user (ARPU)\n4. Customer lifetime value (CLV)\n\nPlease provide your answers in a format like this:\n\n| Metric | Value |\n| --- | --- |\n| AOV | 1234.56 |\n| Conversion Rate | 0.25 |\n| ARPU | 1000.00 |\n| CLV | 5000.00 |\n\nGo ahead and generate the metrics!",
+        "thinking": null,
+        "context": null
+      }
+    },
+    "is_streaming": false
+  }
+}
--- a/tests/integration/recordings/responses/3c0bf9ba81b2.json
+++ b/tests/integration/recordings/responses/3c0bf9ba81b2.json
@ -0,0 +1,56 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://0.0.0.0:11434/v1/v1/chat/completions",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b-instruct-fp16",
+      "messages": [
+        {
+          "role": "user",
+          "content": "Quick test"
+        }
+      ],
+      "max_tokens": 5
+    },
+    "endpoint": "/v1/chat/completions",
+    "model": "llama3.2:3b-instruct-fp16"
+  },
+  "response": {
+    "body": {
+      "__type__": "openai.types.chat.chat_completion.ChatCompletion",
+      "__data__": {
+        "id": "chatcmpl-651",
+        "choices": [
+          {
+            "finish_reason": "length",
+            "index": 0,
+            "logprobs": null,
+            "message": {
+              "content": "I'm ready to help",
+              "refusal": null,
+              "role": "assistant",
+              "annotations": null,
+              "audio": null,
+              "function_call": null,
+              "tool_calls": null
+            }
+          }
+        ],
+        "created": 1755294941,
+        "model": "llama3.2:3b-instruct-fp16",
+        "object": "chat.completion",
+        "service_tier": null,
+        "system_fingerprint": "fp_ollama",
+        "usage": {
+          "completion_tokens": 5,
+          "prompt_tokens": 27,
+          "total_tokens": 32,
+          "completion_tokens_details": null,
+          "prompt_tokens_details": null
+        }
+      }
+    },
+    "is_streaming": false
+  }
+}
--- a/tests/integration/recordings/responses/44a1d9de0602.json
+++ b/tests/integration/recordings/responses/44a1d9de0602.json
@ -0,0 +1,56 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://0.0.0.0:11434/v1/v1/chat/completions",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b-instruct-fp16",
+      "messages": [
+        {
+          "role": "user",
+          "content": "Say hello"
+        }
+      ],
+      "max_tokens": 20
+    },
+    "endpoint": "/v1/chat/completions",
+    "model": "llama3.2:3b-instruct-fp16"
+  },
+  "response": {
+    "body": {
+      "__type__": "openai.types.chat.chat_completion.ChatCompletion",
+      "__data__": {
+        "id": "chatcmpl-987",
+        "choices": [
+          {
+            "finish_reason": "length",
+            "index": 0,
+            "logprobs": null,
+            "message": {
+              "content": "Hello! It's nice to meet you. Is there something I can help you with or would you",
+              "refusal": null,
+              "role": "assistant",
+              "annotations": null,
+              "audio": null,
+              "function_call": null,
+              "tool_calls": null
+            }
+          }
+        ],
+        "created": 1755294921,
+        "model": "llama3.2:3b-instruct-fp16",
+        "object": "chat.completion",
+        "service_tier": null,
+        "system_fingerprint": "fp_ollama",
+        "usage": {
+          "completion_tokens": 20,
+          "prompt_tokens": 27,
+          "total_tokens": 47,
+          "completion_tokens_details": null,
+          "prompt_tokens_details": null
+        }
+      }
+    },
+    "is_streaming": false
+  }
+}
--- a/tests/integration/recordings/responses/4a3a4447b16b.json
+++ b/tests/integration/recordings/responses/4a3a4447b16b.json
@ -14,7 +14,7 @@
        "models": [
          {
            "model": "nomic-embed-text:latest",
-            "modified_at": "2025-08-05T14:04:07.946926-07:00",
+            "modified_at": "2025-08-18T12:47:56.732989-07:00",
            "digest": "0a109f422b47e3a30ba2b10eca18548e944e8a23073ee3f3e947efcf3c45e59f",
            "size": 274302450,
            "details": {
--- a/tests/integration/recordings/responses/4de6877d86fa.json
+++ b/tests/integration/recordings/responses/4de6877d86fa.json
@ -0,0 +1,56 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://localhost:11434/v1/v1/chat/completions",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b",
+      "messages": [
+        {
+          "role": "user",
+          "content": "OpenAI test 0"
+        }
+      ],
+      "stream": false
+    },
+    "endpoint": "/v1/chat/completions",
+    "model": "llama3.2:3b"
+  },
+  "response": {
+    "body": {
+      "__type__": "openai.types.chat.chat_completion.ChatCompletion",
+      "__data__": {
+        "id": "chatcmpl-843",
+        "choices": [
+          {
+            "finish_reason": "stop",
+            "index": 0,
+            "logprobs": null,
+            "message": {
+              "content": "I don't have any information about an \"OpenAI test 0\". It's possible that you may be referring to a specific experiment or task being performed by OpenAI, but without more context, I can only speculate.\n\nHowever, I can tell you that OpenAI is a research organization that has been involved in various projects and tests related to artificial intelligence. If you could provide more context or clarify what you're referring to, I may be able to help further.\n\nIf you're looking for general information about OpenAI, I can try to provide some background on the organization:\n\nOpenAI is a non-profit research organization that was founded in 2015 with the goal of developing and applying advanced artificial intelligence to benefit humanity. The organization has made significant contributions to the field of AI, including the development of the popular language model, ChatGPT.\n\nIf you could provide more context or clarify what you're looking for, I'll do my best to assist you.",
+              "refusal": null,
+              "role": "assistant",
+              "annotations": null,
+              "audio": null,
+              "function_call": null,
+              "tool_calls": null
+            }
+          }
+        ],
+        "created": 1755891518,
+        "model": "llama3.2:3b",
+        "object": "chat.completion",
+        "service_tier": null,
+        "system_fingerprint": "fp_ollama",
+        "usage": {
+          "completion_tokens": 194,
+          "prompt_tokens": 30,
+          "total_tokens": 224,
+          "completion_tokens_details": null,
+          "prompt_tokens_details": null
+        }
+      }
+    },
+    "is_streaming": false
+  }
+}
--- a/tests/integration/recordings/responses/561746e1c8de.json
+++ b/tests/integration/recordings/responses/561746e1c8de.json
@ -21,7 +21,7 @@
        "__type__": "ollama._types.GenerateResponse",
        "__data__": {
          "model": "llama3.2:3b-instruct-fp16",
-          "created_at": "2025-08-04T22:55:14.141947Z",
+          "created_at": "2025-08-15T20:24:49.18651486Z",
          "done": false,
          "done_reason": null,
          "total_duration": null,
@ -39,7 +39,7 @@
        "__type__": "ollama._types.GenerateResponse",
        "__data__": {
          "model": "llama3.2:3b-instruct-fp16",
-          "created_at": "2025-08-04T22:55:14.194979Z",
+          "created_at": "2025-08-15T20:24:49.370611348Z",
          "done": false,
          "done_reason": null,
          "total_duration": null,
@ -57,7 +57,7 @@
        "__type__": "ollama._types.GenerateResponse",
        "__data__": {
          "model": "llama3.2:3b-instruct-fp16",
-          "created_at": "2025-08-04T22:55:14.248312Z",
+          "created_at": "2025-08-15T20:24:49.557000029Z",
          "done": false,
          "done_reason": null,
          "total_duration": null,
@ -75,7 +75,7 @@
        "__type__": "ollama._types.GenerateResponse",
        "__data__": {
          "model": "llama3.2:3b-instruct-fp16",
-          "created_at": "2025-08-04T22:55:14.301911Z",
+          "created_at": "2025-08-15T20:24:49.746777116Z",
          "done": false,
          "done_reason": null,
          "total_duration": null,
@ -93,7 +93,7 @@
        "__type__": "ollama._types.GenerateResponse",
        "__data__": {
          "model": "llama3.2:3b-instruct-fp16",
-          "created_at": "2025-08-04T22:55:14.354437Z",
+          "created_at": "2025-08-15T20:24:49.942233333Z",
          "done": false,
          "done_reason": null,
          "total_duration": null,
@ -111,7 +111,7 @@
        "__type__": "ollama._types.GenerateResponse",
        "__data__": {
          "model": "llama3.2:3b-instruct-fp16",
-          "created_at": "2025-08-04T22:55:14.406821Z",
+          "created_at": "2025-08-15T20:24:50.126788846Z",
          "done": false,
          "done_reason": null,
          "total_duration": null,
@ -129,7 +129,7 @@
        "__type__": "ollama._types.GenerateResponse",
        "__data__": {
          "model": "llama3.2:3b-instruct-fp16",
-          "created_at": "2025-08-04T22:55:14.457633Z",
+          "created_at": "2025-08-15T20:24:50.311346131Z",
          "done": false,
          "done_reason": null,
          "total_duration": null,
@ -147,7 +147,7 @@
        "__type__": "ollama._types.GenerateResponse",
        "__data__": {
          "model": "llama3.2:3b-instruct-fp16",
-          "created_at": "2025-08-04T22:55:14.507857Z",
+          "created_at": "2025-08-15T20:24:50.501507173Z",
          "done": false,
          "done_reason": null,
          "total_duration": null,
@ -165,7 +165,7 @@
        "__type__": "ollama._types.GenerateResponse",
        "__data__": {
          "model": "llama3.2:3b-instruct-fp16",
-          "created_at": "2025-08-04T22:55:14.558847Z",
+          "created_at": "2025-08-15T20:24:50.692296777Z",
          "done": false,
          "done_reason": null,
          "total_duration": null,
@ -183,7 +183,7 @@
        "__type__": "ollama._types.GenerateResponse",
        "__data__": {
          "model": "llama3.2:3b-instruct-fp16",
-          "created_at": "2025-08-04T22:55:14.609969Z",
+          "created_at": "2025-08-15T20:24:50.878846539Z",
          "done": false,
          "done_reason": null,
          "total_duration": null,
@ -201,15 +201,15 @@
        "__type__": "ollama._types.GenerateResponse",
        "__data__": {
          "model": "llama3.2:3b-instruct-fp16",
-          "created_at": "2025-08-04T22:55:14.660997Z",
+          "created_at": "2025-08-15T20:24:51.063200561Z",
          "done": true,
          "done_reason": "stop",
-          "total_duration": 715356542,
-          "load_duration": 59747500,
+          "total_duration": 33982453650,
+          "load_duration": 2909001805,
          "prompt_eval_count": 341,
-          "prompt_eval_duration": 128000000,
+          "prompt_eval_duration": 29194357307,
          "eval_count": 11,
-          "eval_duration": 526000000,
+          "eval_duration": 1878247732,
          "response": "",
          "thinking": null,
          "context": null
--- a/tests/integration/recordings/responses/5db0c44c83a4.json
+++ b/tests/integration/recordings/responses/5db0c44c83a4.json
@ -0,0 +1,56 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://localhost:11434/v1/v1/chat/completions",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b",
+      "messages": [
+        {
+          "role": "user",
+          "content": "OpenAI test 1"
+        }
+      ],
+      "stream": false
+    },
+    "endpoint": "/v1/chat/completions",
+    "model": "llama3.2:3b"
+  },
+  "response": {
+    "body": {
+      "__type__": "openai.types.chat.chat_completion.ChatCompletion",
+      "__data__": {
+        "id": "chatcmpl-726",
+        "choices": [
+          {
+            "finish_reason": "stop",
+            "index": 0,
+            "logprobs": null,
+            "message": {
+              "content": "I'm ready to help with the test. What language would you like to use? Would you like to have a conversation, ask questions, or take a specific type of task?",
+              "refusal": null,
+              "role": "assistant",
+              "annotations": null,
+              "audio": null,
+              "function_call": null,
+              "tool_calls": null
+            }
+          }
+        ],
+        "created": 1755891519,
+        "model": "llama3.2:3b",
+        "object": "chat.completion",
+        "service_tier": null,
+        "system_fingerprint": "fp_ollama",
+        "usage": {
+          "completion_tokens": 37,
+          "prompt_tokens": 30,
+          "total_tokens": 67,
+          "completion_tokens_details": null,
+          "prompt_tokens_details": null
+        }
+      }
+    },
+    "is_streaming": false
+  }
+}
--- a/tests/integration/recordings/responses/6cb0285a7638.json
+++ b/tests/integration/recordings/responses/6cb0285a7638.json
@ -0,0 +1,56 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://localhost:11434/v1/v1/chat/completions",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b",
+      "messages": [
+        {
+          "role": "user",
+          "content": "OpenAI test 4"
+        }
+      ],
+      "stream": false
+    },
+    "endpoint": "/v1/chat/completions",
+    "model": "llama3.2:3b"
+  },
+  "response": {
+    "body": {
+      "__type__": "openai.types.chat.chat_completion.ChatCompletion",
+      "__data__": {
+        "id": "chatcmpl-581",
+        "choices": [
+          {
+            "finish_reason": "stop",
+            "index": 0,
+            "logprobs": null,
+            "message": {
+              "content": "I'm ready to help. What would you like to test? We could try a variety of things, such as:\n\n1. Conversational dialogue\n2. Language understanding\n3. Common sense reasoning\n4. Joke or pun generation\n5. Trivia or knowledge-based questions\n6. Creative writing or storytelling\n7. Summarization or paraphrasing\n\nLet me know which area you'd like to test, or suggest something else that's on your mind!",
+              "refusal": null,
+              "role": "assistant",
+              "annotations": null,
+              "audio": null,
+              "function_call": null,
+              "tool_calls": null
+            }
+          }
+        ],
+        "created": 1755891527,
+        "model": "llama3.2:3b",
+        "object": "chat.completion",
+        "service_tier": null,
+        "system_fingerprint": "fp_ollama",
+        "usage": {
+          "completion_tokens": 96,
+          "prompt_tokens": 30,
+          "total_tokens": 126,
+          "completion_tokens_details": null,
+          "prompt_tokens_details": null
+        }
+      }
+    },
+    "is_streaming": false
+  }
+}
--- a/tests/integration/recordings/responses/6fe1d4fedf12.json
+++ b/tests/integration/recordings/responses/6fe1d4fedf12.json
--- a/tests/integration/recordings/responses/731824c54461.json
+++ b/tests/integration/recordings/responses/731824c54461.json
@ -0,0 +1,203 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://localhost:11434/api/generate",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b-instruct-fp16",
+      "raw": true,
+      "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nGive me a sentence that contains the word: hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
+      "options": {
+        "temperature": 0.0
+      },
+      "stream": true
+    },
+    "endpoint": "/api/generate",
+    "model": "llama3.2:3b-instruct-fp16"
+  },
+  "response": {
+    "body": [
+      {
+        "__type__": "ollama._types.GenerateResponse",
+        "__data__": {
+          "model": "llama3.2:3b-instruct-fp16",
+          "created_at": "2025-08-18T19:47:58.267146Z",
+          "done": false,
+          "done_reason": null,
+          "total_duration": null,
+          "load_duration": null,
+          "prompt_eval_count": null,
+          "prompt_eval_duration": null,
+          "eval_count": null,
+          "eval_duration": null,
+          "response": "Hello",
+          "thinking": null,
+          "context": null
+        }
+      },
+      {
+        "__type__": "ollama._types.GenerateResponse",
+        "__data__": {
+          "model": "llama3.2:3b-instruct-fp16",
+          "created_at": "2025-08-18T19:47:58.309006Z",
+          "done": false,
+          "done_reason": null,
+          "total_duration": null,
+          "load_duration": null,
+          "prompt_eval_count": null,
+          "prompt_eval_duration": null,
+          "eval_count": null,
+          "eval_duration": null,
+          "response": ",",
+          "thinking": null,
+          "context": null
+        }
+      },
+      {
+        "__type__": "ollama._types.GenerateResponse",
+        "__data__": {
+          "model": "llama3.2:3b-instruct-fp16",
+          "created_at": "2025-08-18T19:47:58.351179Z",
+          "done": false,
+          "done_reason": null,
+          "total_duration": null,
+          "load_duration": null,
+          "prompt_eval_count": null,
+          "prompt_eval_duration": null,
+          "eval_count": null,
+          "eval_duration": null,
+          "response": " how",
+          "thinking": null,
+          "context": null
+        }
+      },
+      {
+        "__type__": "ollama._types.GenerateResponse",
+        "__data__": {
+          "model": "llama3.2:3b-instruct-fp16",
+          "created_at": "2025-08-18T19:47:58.393262Z",
+          "done": false,
+          "done_reason": null,
+          "total_duration": null,
+          "load_duration": null,
+          "prompt_eval_count": null,
+          "prompt_eval_duration": null,
+          "eval_count": null,
+          "eval_duration": null,
+          "response": " can",
+          "thinking": null,
+          "context": null
+        }
+      },
+      {
+        "__type__": "ollama._types.GenerateResponse",
+        "__data__": {
+          "model": "llama3.2:3b-instruct-fp16",
+          "created_at": "2025-08-18T19:47:58.436079Z",
+          "done": false,
+          "done_reason": null,
+          "total_duration": null,
+          "load_duration": null,
+          "prompt_eval_count": null,
+          "prompt_eval_duration": null,
+          "eval_count": null,
+          "eval_duration": null,
+          "response": " I",
+          "thinking": null,
+          "context": null
+        }
+      },
+      {
+        "__type__": "ollama._types.GenerateResponse",
+        "__data__": {
+          "model": "llama3.2:3b-instruct-fp16",
+          "created_at": "2025-08-18T19:47:58.478393Z",
+          "done": false,
+          "done_reason": null,
+          "total_duration": null,
+          "load_duration": null,
+          "prompt_eval_count": null,
+          "prompt_eval_duration": null,
+          "eval_count": null,
+          "eval_duration": null,
+          "response": " assist",
+          "thinking": null,
+          "context": null
+        }
+      },
+      {
+        "__type__": "ollama._types.GenerateResponse",
+        "__data__": {
+          "model": "llama3.2:3b-instruct-fp16",
+          "created_at": "2025-08-18T19:47:58.520608Z",
+          "done": false,
+          "done_reason": null,
+          "total_duration": null,
+          "load_duration": null,
+          "prompt_eval_count": null,
+          "prompt_eval_duration": null,
+          "eval_count": null,
+          "eval_duration": null,
+          "response": " you",
+          "thinking": null,
+          "context": null
+        }
+      },
+      {
+        "__type__": "ollama._types.GenerateResponse",
+        "__data__": {
+          "model": "llama3.2:3b-instruct-fp16",
+          "created_at": "2025-08-18T19:47:58.562885Z",
+          "done": false,
+          "done_reason": null,
+          "total_duration": null,
+          "load_duration": null,
+          "prompt_eval_count": null,
+          "prompt_eval_duration": null,
+          "eval_count": null,
+          "eval_duration": null,
+          "response": " today",
+          "thinking": null,
+          "context": null
+        }
+      },
+      {
+        "__type__": "ollama._types.GenerateResponse",
+        "__data__": {
+          "model": "llama3.2:3b-instruct-fp16",
+          "created_at": "2025-08-18T19:47:58.604683Z",
+          "done": false,
+          "done_reason": null,
+          "total_duration": null,
+          "load_duration": null,
+          "prompt_eval_count": null,
+          "prompt_eval_duration": null,
+          "eval_count": null,
+          "eval_duration": null,
+          "response": "?",
+          "thinking": null,
+          "context": null
+        }
+      },
+      {
+        "__type__": "ollama._types.GenerateResponse",
+        "__data__": {
+          "model": "llama3.2:3b-instruct-fp16",
+          "created_at": "2025-08-18T19:47:58.646586Z",
+          "done": true,
+          "done_reason": "stop",
+          "total_duration": 1011323917,
+          "load_duration": 76575458,
+          "prompt_eval_count": 31,
+          "prompt_eval_duration": 553259250,
+          "eval_count": 10,
+          "eval_duration": 380302792,
+          "response": "",
+          "thinking": null,
+          "context": null
+        }
+      }
+    ],
+    "is_streaming": true
+  }
+}
--- a/tests/integration/recordings/responses/7bcb0f86c91b.json
+++ b/tests/integration/recordings/responses/7bcb0f86c91b.json
@ -0,0 +1,39 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://localhost:11434/api/generate",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b-instruct-fp16",
+      "raw": true,
+      "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTest metrics generation 0<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
+      "options": {
+        "temperature": 0.0
+      },
+      "stream": false
+    },
+    "endpoint": "/api/generate",
+    "model": "llama3.2:3b-instruct-fp16"
+  },
+  "response": {
+    "body": {
+      "__type__": "ollama._types.GenerateResponse",
+      "__data__": {
+        "model": "llama3.2:3b-instruct-fp16",
+        "created_at": "2025-08-11T15:51:12.918723Z",
+        "done": true,
+        "done_reason": "stop",
+        "total_duration": 8868987792,
+        "load_duration": 2793275292,
+        "prompt_eval_count": 21,
+        "prompt_eval_duration": 250000000,
+        "eval_count": 344,
+        "eval_duration": 5823000000,
+        "response": "Here are some common test metrics used to evaluate the performance of a system:\n\n1. **Accuracy**: The proportion of correct predictions or classifications out of total predictions made.\n2. **Precision**: The ratio of true positives (correctly predicted instances) to the sum of true positives and false positives (incorrectly predicted instances).\n3. **Recall**: The ratio of true positives to the sum of true positives and false negatives (missed instances).\n4. **F1-score**: The harmonic mean of precision and recall, providing a balanced measure of both.\n5. **Mean Squared Error (MSE)**: The average squared difference between predicted and actual values.\n6. **Mean Absolute Error (MAE)**: The average absolute difference between predicted and actual values.\n7. **Root Mean Squared Percentage Error (RMSPE)**: The square root of the mean of the squared percentage differences between predicted and actual values.\n8. **Coefficient of Determination (R-squared, R2)**: Measures how well a model fits the data, with higher values indicating better fit.\n9. **Mean Absolute Percentage Error (MAPE)**: The average absolute percentage difference between predicted and actual values.\n10. **Normalized Mean Squared Error (NMSE)**: Similar to MSE, but normalized by the mean of the actual values.\n\nThese metrics can be used for various types of data, including:\n\n* Regression problems (e.g., predicting continuous values)\n* Classification problems (e.g., predicting categorical labels)\n* Time series forecasting\n* Clustering and dimensionality reduction\n\nWhen choosing a metric, consider the specific problem you're trying to solve, the type of data, and the desired level of precision.",
+        "thinking": null,
+        "context": null
+      }
+    },
+    "is_streaming": false
+  }
+}
--- a/tests/integration/recordings/responses/bf79a89cc37f.json
+++ b/tests/integration/recordings/responses/bf79a89cc37f.json
@ -0,0 +1,56 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://localhost:11434/v1/v1/chat/completions",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b",
+      "messages": [
+        {
+          "role": "user",
+          "content": "OpenAI test 3"
+        }
+      ],
+      "stream": false
+    },
+    "endpoint": "/v1/chat/completions",
+    "model": "llama3.2:3b"
+  },
+  "response": {
+    "body": {
+      "__type__": "openai.types.chat.chat_completion.ChatCompletion",
+      "__data__": {
+        "id": "chatcmpl-48",
+        "choices": [
+          {
+            "finish_reason": "stop",
+            "index": 0,
+            "logprobs": null,
+            "message": {
+              "content": "I'm happy to help, but it seems you want me to engage in a basic conversation as OpenAI's new chat model, right? I can do that!\n\nHere's my response:\n\nHello! How are you today? Is there something specific on your mind that you'd like to talk about or any particular topic you'd like to explore together?\n\nWhat is it that you're curious about?",
+              "refusal": null,
+              "role": "assistant",
+              "annotations": null,
+              "audio": null,
+              "function_call": null,
+              "tool_calls": null
+            }
+          }
+        ],
+        "created": 1755891524,
+        "model": "llama3.2:3b",
+        "object": "chat.completion",
+        "service_tier": null,
+        "system_fingerprint": "fp_ollama",
+        "usage": {
+          "completion_tokens": 80,
+          "prompt_tokens": 30,
+          "total_tokens": 110,
+          "completion_tokens_details": null,
+          "prompt_tokens_details": null
+        }
+      }
+    },
+    "is_streaming": false
+  }
+}
--- a/tests/integration/recordings/responses/c31a86ea6c58.json
+++ b/tests/integration/recordings/responses/c31a86ea6c58.json
@ -0,0 +1,39 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://localhost:11434/api/generate",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b",
+      "raw": true,
+      "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTest metrics generation 0<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
+      "options": {
+        "temperature": 0.0
+      },
+      "stream": false
+    },
+    "endpoint": "/api/generate",
+    "model": "llama3.2:3b"
+  },
+  "response": {
+    "body": {
+      "__type__": "ollama._types.GenerateResponse",
+      "__data__": {
+        "model": "llama3.2:3b",
+        "created_at": "2025-08-11T15:56:06.703788Z",
+        "done": true,
+        "done_reason": "stop",
+        "total_duration": 2722294000,
+        "load_duration": 9736083,
+        "prompt_eval_count": 21,
+        "prompt_eval_duration": 113000000,
+        "eval_count": 324,
+        "eval_duration": 2598000000,
+        "response": "Here are some test metrics that can be used to evaluate the performance of a system:\n\n1. **Accuracy**: The proportion of correct predictions made by the model.\n2. **Precision**: The ratio of true positives (correctly predicted instances) to total positive predictions.\n3. **Recall**: The ratio of true positives to the sum of true positives and false negatives (missed instances).\n4. **F1-score**: The harmonic mean of precision and recall, providing a balanced measure of both.\n5. **Mean Squared Error (MSE)**: The average squared difference between predicted and actual values.\n6. **Mean Absolute Error (MAE)**: The average absolute difference between predicted and actual values.\n7. **Root Mean Squared Percentage Error (RMSPE)**: A variation of MSE that expresses the error as a percentage.\n8. **Coefficient of Determination (R-squared, R2)**: Measures how well the model explains the variance in the data.\n9. **Mean Absolute Percentage Error (MAPE)**: The average absolute percentage difference between predicted and actual values.\n10. **Mean Squared Logarithmic Error (MSLE)**: A variation of MSE that is more suitable for skewed distributions.\n\nThese metrics can be used to evaluate different aspects of a system's performance, such as:\n\n* Classification models: accuracy, precision, recall, F1-score\n* Regression models: MSE, MAE, RMSPE, R2, MSLE\n* Time series forecasting: MAPE, RMSPE\n\nNote that the choice of metric depends on the specific problem and data.",
+        "thinking": null,
+        "context": null
+      }
+    },
+    "is_streaming": false
+  }
+}
--- a/tests/integration/recordings/responses/d0ac68cbde69.json
+++ b/tests/integration/recordings/responses/d0ac68cbde69.json
@ -13,12 +13,12 @@
      "__data__": {
        "models": [
          {
-            "model": "llama3.2:3b",
-            "name": "llama3.2:3b",
-            "digest": "a80c4f17acd55265feec403c7aef86be0c25983ab279d83f3bcd3abbcb5b8b72",
-            "expires_at": "2025-08-06T15:57:21.573326-04:00",
-            "size": 4030033920,
-            "size_vram": 4030033920,
+            "model": "llama3.2:3b-instruct-fp16",
+            "name": "llama3.2:3b-instruct-fp16",
+            "digest": "195a8c01d91ec3cb1e0aad4624a51f2602c51fa7d96110f8ab5a20c84081804d",
+            "expires_at": "2025-08-18T13:47:44.262256-07:00",
+            "size": 7919570944,
+            "size_vram": 7919570944,
            "details": {
              "parent_model": "",
              "format": "gguf",
@ -27,7 +27,7 @@
                "llama"
              ],
              "parameter_size": "3.2B",
-              "quantization_level": "Q4_K_M"
+              "quantization_level": "F16"
            }
          }
        ]
--- a/tests/integration/recordings/responses/dc8120cf0774.json
+++ b/tests/integration/recordings/responses/dc8120cf0774.json
@ -0,0 +1,56 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://localhost:11434/v1/v1/chat/completions",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b",
+      "messages": [
+        {
+          "role": "user",
+          "content": "OpenAI test 2"
+        }
+      ],
+      "stream": false
+    },
+    "endpoint": "/v1/chat/completions",
+    "model": "llama3.2:3b"
+  },
+  "response": {
+    "body": {
+      "__type__": "openai.types.chat.chat_completion.ChatCompletion",
+      "__data__": {
+        "id": "chatcmpl-516",
+        "choices": [
+          {
+            "finish_reason": "stop",
+            "index": 0,
+            "logprobs": null,
+            "message": {
+              "content": "I'm happy to help with your question or task. Please go ahead and ask me anything, and I'll do my best to assist you.\n\nNote: I'll be using the latest version of my knowledge cutoff, which is December 2023.\n\nAlso, please keep in mind that I'm a large language model, I can provide information on a broad range of topics, including science, history, technology, culture, and more. However, my ability to understand and respond to specific questions or requests may be limited by the data I've been trained on.",
+              "refusal": null,
+              "role": "assistant",
+              "annotations": null,
+              "audio": null,
+              "function_call": null,
+              "tool_calls": null
+            }
+          }
+        ],
+        "created": 1755891522,
+        "model": "llama3.2:3b",
+        "object": "chat.completion",
+        "service_tier": null,
+        "system_fingerprint": "fp_ollama",
+        "usage": {
+          "completion_tokens": 113,
+          "prompt_tokens": 30,
+          "total_tokens": 143,
+          "completion_tokens_details": null,
+          "prompt_tokens_details": null
+        }
+      }
+    },
+    "is_streaming": false
+  }
+}
--- a/tests/integration/recordings/responses/decfd950646c.json
+++ b/tests/integration/recordings/responses/decfd950646c.json
@ -0,0 +1,109 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://localhost:11434/v1/v1/chat/completions",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b-instruct-fp16",
+      "messages": [
+        {
+          "role": "user",
+          "content": "What's the weather in Tokyo? YOU MUST USE THE get_weather function to get the weather."
+        }
+      ],
+      "response_format": {
+        "type": "text"
+      },
+      "stream": true,
+      "tools": [
+        {
+          "type": "function",
+          "function": {
+            "type": "function",
+            "name": "get_weather",
+            "description": "Get the weather in a given city",
+            "parameters": {
+              "type": "object",
+              "properties": {
+                "city": {
+                  "type": "string",
+                  "description": "The city to get the weather for"
+                }
+              }
+            },
+            "strict": null
+          }
+        }
+      ]
+    },
+    "endpoint": "/v1/chat/completions",
+    "model": "llama3.2:3b-instruct-fp16"
+  },
+  "response": {
+    "body": [
+      {
+        "__type__": "openai.types.chat.chat_completion_chunk.ChatCompletionChunk",
+        "__data__": {
+          "id": "chatcmpl-620",
+          "choices": [
+            {
+              "delta": {
+                "content": "",
+                "function_call": null,
+                "refusal": null,
+                "role": "assistant",
+                "tool_calls": [
+                  {
+                    "index": 0,
+                    "id": "call_490d5ur7",
+                    "function": {
+                      "arguments": "{\"city\":\"Tokyo\"}",
+                      "name": "get_weather"
+                    },
+                    "type": "function"
+                  }
+                ]
+              },
+              "finish_reason": null,
+              "index": 0,
+              "logprobs": null
+            }
+          ],
+          "created": 1755228972,
+          "model": "llama3.2:3b-instruct-fp16",
+          "object": "chat.completion.chunk",
+          "service_tier": null,
+          "system_fingerprint": "fp_ollama",
+          "usage": null
+        }
+      },
+      {
+        "__type__": "openai.types.chat.chat_completion_chunk.ChatCompletionChunk",
+        "__data__": {
+          "id": "chatcmpl-620",
+          "choices": [
+            {
+              "delta": {
+                "content": "",
+                "function_call": null,
+                "refusal": null,
+                "role": "assistant",
+                "tool_calls": null
+              },
+              "finish_reason": "tool_calls",
+              "index": 0,
+              "logprobs": null
+            }
+          ],
+          "created": 1755228972,
+          "model": "llama3.2:3b-instruct-fp16",
+          "object": "chat.completion.chunk",
+          "service_tier": null,
+          "system_fingerprint": "fp_ollama",
+          "usage": null
+        }
+      }
+    ],
+    "is_streaming": true
+  }
+}
--- a/tests/integration/recordings/responses/f6857bcea729.json
+++ b/tests/integration/recordings/responses/f6857bcea729.json
@ -0,0 +1,39 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://localhost:11434/api/generate",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b",
+      "raw": true,
+      "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTest metrics generation 2<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
+      "options": {
+        "temperature": 0.0
+      },
+      "stream": false
+    },
+    "endpoint": "/api/generate",
+    "model": "llama3.2:3b"
+  },
+  "response": {
+    "body": {
+      "__type__": "ollama._types.GenerateResponse",
+      "__data__": {
+        "model": "llama3.2:3b",
+        "created_at": "2025-08-11T15:56:13.082679Z",
+        "done": true,
+        "done_reason": "stop",
+        "total_duration": 2606245291,
+        "load_duration": 9979708,
+        "prompt_eval_count": 21,
+        "prompt_eval_duration": 23000000,
+        "eval_count": 321,
+        "eval_duration": 2572000000,
+        "response": "Here are some test metrics that can be used to evaluate the performance of a system:\n\n1. **Accuracy**: Measures how close the predicted values are to the actual values.\n2. **Precision**: Measures the proportion of true positives among all positive predictions made by the model.\n3. **Recall**: Measures the proportion of true positives among all actual positive instances.\n4. **F1-score**: The harmonic mean of precision and recall, providing a balanced measure of both.\n5. **Mean Squared Error (MSE)**: Measures the average squared difference between predicted and actual values.\n6. **Mean Absolute Error (MAE)**: Measures the average absolute difference between predicted and actual values.\n7. **Root Mean Squared Percentage Error (RMSPE)**: A variation of MSE that expresses errors as a percentage of the actual value.\n8. **Coefficient of Determination (R-squared, R2)**: Measures how well the model explains the variance in the data.\n9. **Mean Absolute Percentage Error (MAPE)**: Measures the average absolute percentage difference between predicted and actual values.\n10. **Mean Squared Logarithmic Error (MSLE)**: A variation of MSE that is more suitable for skewed distributions.\n\nThese metrics can be used to evaluate different aspects of a system's performance, such as:\n\n* Classification models: accuracy, precision, recall, F1-score\n* Regression models: MSE, MAE, RMSPE, R2\n* Time series forecasting: MAPE, MSLE\n\nNote that the choice of metric depends on the specific problem and data.",
+        "thinking": null,
+        "context": null
+      }
+    },
+    "is_streaming": false
+  }
+}
--- a/tests/integration/recordings/responses/f80b99430f7e.json
+++ b/tests/integration/recordings/responses/f80b99430f7e.json
@ -0,0 +1,39 @@
+{
+  "request": {
+    "method": "POST",
+    "url": "http://localhost:11434/api/generate",
+    "headers": {},
+    "body": {
+      "model": "llama3.2:3b",
+      "raw": true,
+      "prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTest metrics generation 1<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
+      "options": {
+        "temperature": 0.0
+      },
+      "stream": false
+    },
+    "endpoint": "/api/generate",
+    "model": "llama3.2:3b"
+  },
+  "response": {
+    "body": {
+      "__type__": "ollama._types.GenerateResponse",
+      "__data__": {
+        "model": "llama3.2:3b",
+        "created_at": "2025-08-11T15:56:10.465932Z",
+        "done": true,
+        "done_reason": "stop",
+        "total_duration": 3745686709,
+        "load_duration": 9734584,
+        "prompt_eval_count": 21,
+        "prompt_eval_duration": 23000000,
+        "eval_count": 457,
+        "eval_duration": 3712000000,
+        "response": "Here are some test metrics that can be used to evaluate the performance of a system:\n\n**Primary Metrics**\n\n1. **Response Time**: The time it takes for the system to respond to a request.\n2. **Throughput**: The number of requests processed by the system per unit time (e.g., requests per second).\n3. **Error Rate**: The percentage of requests that result in an error.\n\n**Secondary Metrics**\n\n1. **Average Response Time**: The average response time for all requests.\n2. **Median Response Time**: The middle value of the response times, used to detect outliers.\n3. **99th Percentile Response Time**: The response time at which 99% of requests are completed within this time.\n4. **Request Latency**: The difference between the request arrival time and the response time.\n\n**User Experience Metrics**\n\n1. **User Satisfaction (USAT)**: Measured through surveys or feedback forms to gauge user satisfaction with the system's performance.\n2. **First Response Time**: The time it takes for a user to receive their first response from the system.\n3. **Time Spent in System**: The total amount of time a user spends interacting with the system.\n\n**System Resource Metrics**\n\n1. **CPU Utilization**: The percentage of CPU resources being used by the system.\n2. **Memory Usage**: The amount of memory being used by the system.\n3. **Disk I/O Wait Time**: The average time spent waiting for disk I/O operations to complete.\n\n**Security Metrics**\n\n1. **Authentication Success Rate**: The percentage of successful authentication attempts.\n2. **Authorization Success Rate**: The percentage of successful authorization attempts.\n3. **Error Rate (Security)**: The percentage of security-related errors.\n\n**Other Metrics**\n\n1. **Page Load Time**: The time it takes for a page to load.\n2. **Click-Through Rate (CTR)**: The percentage of users who click on a link or button after seeing an ad or notification.\n3. **Conversion Rate**: The percentage of users who complete a desired action (e.g., fill out a form, make a purchase).\n\nThese metrics can be used to evaluate the performance and effectiveness of various aspects of your system, from user experience to security and resource utilization.",
+        "thinking": null,
+        "context": null
+      }
+    },
+    "is_streaming": false
+  }
+}
--- a/tests/integration/safety/test_safety.py
+++ b/tests/integration/safety/test_safety.py
@ -27,10 +27,10 @@ def data_url_from_image(file_path):

@pytest.fixture(scope="session")
 def code_scanner_shield_id(available_shields):
-    if "CodeScanner" in available_shields:
-        return "CodeScanner"
+    if "code-scanner" in available_shields:
+        return "code-scanner"

-    pytest.skip("CodeScanner shield is not available. Skipping.")
+    pytest.skip("code-scanner shield is not available. Skipping.")


 def test_unsafe_examples(client_with_models, shield_id):
@ -163,6 +163,60 @@ def test_safety_with_code_scanner(client_with_models, code_scanner_shield_id, mo
    assert response.violation.user_message == "Sorry, I found security concerns in the code."


+def test_unsafe_input_with_code_scanner_run_moderation(client_with_models, code_scanner_shield_id, model_providers):
+    code_safety_enabled = len(CODE_SCANNER_ENABLED_PROVIDERS & model_providers) > 0
+    if not code_safety_enabled:
+        pytest.skip(f"Testing code scanner shields is not supported for model_providers {model_providers}")
+
+    insecure_code = [
+        """
+    Here is an example of code that hashes a password using MD5:
+
+    ```
+    import hashlib
+    password = 'securepassword123'.encode()
+    hashed_password = hashlib.md5(password).hexdigest()
+    print("Insecure hashed password:", hashed_password)
+    ```
+    """
+    ]
+    moderation_object = client_with_models.moderations.create(
+        input=insecure_code,
+        model=code_scanner_shield_id,
+    )
+    assert moderation_object.results[0].flagged is True, f"Code scanner should have flagged {insecure_code} as insecure"
+    assert all(value is True for value in moderation_object.results[0].categories.values()), (
+        "Code scanner shield should have detected code insecure category"
+    )
+
+
+def test_safe_input_with_code_scanner_run_moderation(client_with_models, code_scanner_shield_id, model_providers):
+    code_safety_enabled = len(CODE_SCANNER_ENABLED_PROVIDERS & model_providers) > 0
+    if not code_safety_enabled:
+        pytest.skip(f"Testing code scanner shields is not supported for model_providers {model_providers}")
+
+    secure_code = [
+        """
+    Extract the first 5 characters from a string:
+    ```
+        text = "Hello World"
+        first_five = text[:5]
+        print(first_five)  # Output: "Hello"
+
+        # Safe handling for strings shorter than 5 characters
+        def get_first_five(text):
+            return text[:5] if text else ""
+    ```
+    """
+    ]
+    moderation_object = client_with_models.moderations.create(
+        input=secure_code,
+        model=code_scanner_shield_id,
+    )
+
+    assert moderation_object.results[0].flagged is False, "Code scanner should not have flagged the code as insecure"
+
+
 # We can use an instance of the LlamaGuard shield to detect attempts to misuse
 # the interpreter as this is one of the existing categories it checks for
 def test_safety_with_code_interpreter_abuse(client_with_models, shield_id):
--- a/tests/integration/telemetry/test_telemetry_metrics.py
+++ b/tests/integration/telemetry/test_telemetry_metrics.py
@ -0,0 +1,209 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the terms described in the LICENSE file in
+# the root directory of this source tree.
+
+import time
+from datetime import UTC, datetime, timedelta
+
+import pytest
+
+
+@pytest.fixture(scope="module", autouse=True)
+def setup_telemetry_metrics_data(openai_client, client_with_models, text_model_id):
+    """Setup fixture that creates telemetry metrics data before tests run."""
+
+    # Skip OpenAI tests if running in library mode
+    if not hasattr(client_with_models, "base_url"):
+        pytest.skip("OpenAI client tests not supported with library client")
+
+    prompt_tokens = []
+    completion_tokens = []
+    total_tokens = []
+
+    # Create OpenAI completions to generate metrics using the proper OpenAI client
+    for i in range(5):
+        response = openai_client.chat.completions.create(
+            model=text_model_id,
+            messages=[{"role": "user", "content": f"OpenAI test {i}"}],
+            stream=False,
+        )
+        prompt_tokens.append(response.usage.prompt_tokens)
+        completion_tokens.append(response.usage.completion_tokens)
+        total_tokens.append(response.usage.total_tokens)
+
+    # Wait for metrics to be logged
+    start_time = time.time()
+    while time.time() - start_time < 30:
+        try:
+            # Try to query metrics to see if they're available
+            metrics_response = client_with_models.telemetry.query_metrics(
+                metric_name="completion_tokens",
+                start_time=int((datetime.now(UTC) - timedelta(minutes=5)).timestamp()),
+            )
+            if len(metrics_response[0].values) > 0:
+                break
+        except Exception:
+            pass
+        time.sleep(1)
+
+    # Wait additional time to ensure all metrics are processed
+    time.sleep(5)
+
+    # Return the token lists for use in tests
+    return {"prompt_tokens": prompt_tokens, "completion_tokens": completion_tokens, "total_tokens": total_tokens}
+
+
+@pytest.mark.skip(reason="Skipping this test until client is regenerated")
+def test_query_metrics_prompt_tokens(client_with_models, text_model_id, setup_telemetry_metrics_data):
+    """Test that prompt_tokens metrics are queryable."""
+    start_time = int((datetime.now(UTC) - timedelta(minutes=10)).timestamp())
+
+    response = client_with_models.telemetry.query_metrics(
+        metric_name="prompt_tokens",
+        start_time=start_time,
+    )
+
+    assert isinstance(response, list)
+
+    assert isinstance(response[0].values, list), "Should return a list of metric series"
+
+    assert response[0].metric == "prompt_tokens"
+
+    # Use the actual values from setup instead of hardcoded values
+    expected_values = setup_telemetry_metrics_data["prompt_tokens"]
+    assert response[0].values[-1].value in expected_values, (
+        f"Expected one of {expected_values}, got {response[0].values[-1].value}"
+    )
+
+
+@pytest.mark.skip(reason="Skipping this test until client is regenerated")
+def test_query_metrics_completion_tokens(client_with_models, text_model_id, setup_telemetry_metrics_data):
+    """Test that completion_tokens metrics are queryable."""
+    start_time = int((datetime.now(UTC) - timedelta(minutes=10)).timestamp())
+
+    response = client_with_models.telemetry.query_metrics(
+        metric_name="completion_tokens",
+        start_time=start_time,
+    )
+
+    assert isinstance(response, list)
+
+    assert isinstance(response[0].values, list), "Should return a list of metric series"
+
+    assert response[0].metric == "completion_tokens"
+
+    # Use the actual values from setup instead of hardcoded values
+    expected_values = setup_telemetry_metrics_data["completion_tokens"]
+    assert response[0].values[-1].value in expected_values, (
+        f"Expected one of {expected_values}, got {response[0].values[-1].value}"
+    )
+
+
+@pytest.mark.skip(reason="Skipping this test until client is regenerated")
+def test_query_metrics_total_tokens(client_with_models, text_model_id, setup_telemetry_metrics_data):
+    """Test that total_tokens metrics are queryable."""
+    start_time = int((datetime.now(UTC) - timedelta(minutes=10)).timestamp())
+
+    response = client_with_models.telemetry.query_metrics(
+        metric_name="total_tokens",
+        start_time=start_time,
+    )
+
+    assert isinstance(response, list)
+
+    assert isinstance(response[0].values, list), "Should return a list of metric series"
+
+    assert response[0].metric == "total_tokens"
+
+    # Use the actual values from setup instead of hardcoded values
+    expected_values = setup_telemetry_metrics_data["total_tokens"]
+    assert response[0].values[-1].value in expected_values, (
+        f"Expected one of {expected_values}, got {response[0].values[-1].value}"
+    )
+
+
+@pytest.mark.skip(reason="Skipping this test until client is regenerated")
+def test_query_metrics_with_time_range(llama_stack_client, text_model_id):
+    """Test that metrics are queryable with time range."""
+    end_time = int(datetime.now(UTC).timestamp())
+    start_time = end_time - 600  # 10 minutes ago
+
+    response = llama_stack_client.telemetry.query_metrics(
+        metric_name="prompt_tokens",
+        start_time=start_time,
+        end_time=end_time,
+    )
+
+    assert isinstance(response, list)
+
+    assert isinstance(response[0].values, list), "Should return a list of metric series"
+
+    assert response[0].metric == "prompt_tokens"
+
+
+@pytest.mark.skip(reason="Skipping this test until client is regenerated")
+def test_query_metrics_with_label_matchers(llama_stack_client, text_model_id):
+    """Test that metrics are queryable with label matchers."""
+    start_time = int((datetime.now(UTC) - timedelta(minutes=10)).timestamp())
+
+    response = llama_stack_client.telemetry.query_metrics(
+        metric_name="prompt_tokens",
+        start_time=start_time,
+        label_matchers=[{"name": "model_id", "value": text_model_id, "operator": "="}],
+    )
+
+    assert isinstance(response[0].values, list), "Should return a list of metric series"
+
+
+@pytest.mark.skip(reason="Skipping this test until client is regenerated")
+def test_query_metrics_nonexistent_metric(llama_stack_client):
+    """Test that querying a nonexistent metric returns empty data."""
+    start_time = int((datetime.now(UTC) - timedelta(minutes=10)).timestamp())
+
+    response = llama_stack_client.telemetry.query_metrics(
+        metric_name="nonexistent_metric",
+        start_time=start_time,
+    )
+
+    assert isinstance(response, list), "Should return an empty list for nonexistent metric"
+    assert len(response) == 0
+
+
+@pytest.mark.skip(reason="Skipping this test until client is regenerated")
+def test_query_metrics_with_granularity(llama_stack_client, text_model_id):
+    """Test that metrics are queryable with different granularity levels."""
+    start_time = int((datetime.now(UTC) - timedelta(minutes=10)).timestamp())
+
+    # Test hourly granularity
+    hourly_response = llama_stack_client.telemetry.query_metrics(
+        metric_name="total_tokens",
+        start_time=start_time,
+        granularity="1h",
+    )
+
+    # Test daily granularity
+    daily_response = llama_stack_client.telemetry.query_metrics(
+        metric_name="total_tokens",
+        start_time=start_time,
+        granularity="1d",
+    )
+
+    # Test no granularity (raw data points)
+    raw_response = llama_stack_client.telemetry.query_metrics(
+        metric_name="total_tokens",
+        start_time=start_time,
+        granularity=None,
+    )
+
+    # All should return valid data
+    assert isinstance(hourly_response[0].values, list), "Hourly granularity should return data"
+    assert isinstance(daily_response[0].values, list), "Daily granularity should return data"
+    assert isinstance(raw_response[0].values, list), "No granularity should return data"
+
+    # Verify that different granularities produce different aggregation levels
+    # (The exact number depends on data distribution, but they should be queryable)
+    assert len(hourly_response[0].values) >= 0, "Hourly granularity should be queryable"
+    assert len(daily_response[0].values) >= 0, "Daily granularity should be queryable"
+    assert len(raw_response[0].values) >= 0, "No granularity should be queryable"
--- a/tests/integration/vector_io/test_openai_vector_stores.py
+++ b/tests/integration/vector_io/test_openai_vector_stores.py
@ -4,7 +4,6 @@
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.

-import logging
 import time
 import uuid
 from io import BytesIO
@ -15,8 +14,9 @@ from openai import BadRequestError as OpenAIBadRequestError

 from llama_stack.apis.vector_io import Chunk
 from llama_stack.core.library_client import LlamaStackAsLibraryClient
+from llama_stack.log import get_logger

-logger = logging.getLogger(__name__)
+logger = get_logger(name=__name__, category="vector_io")


 def skip_if_provider_doesnt_support_openai_vector_stores(client_with_models):
@ -57,6 +57,7 @@ def skip_if_provider_doesnt_support_openai_vector_stores_search(client_with_mode
        "keyword": [
            "inline::sqlite-vec",
            "remote::milvus",
+            "inline::milvus",
        ],
        "hybrid": [
            "inline::sqlite-vec",