Merge branch 'main' into milvus/search-modes

This commit is contained in:
Francisco Arceo 2025-08-19 10:20:30 -06:00 committed by GitHub
commit a4b3c009cf
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
185 changed files with 15509 additions and 8439 deletions

View file

@ -1,6 +1,20 @@
# Llama Stack Integration Tests
# Integration Testing Guide
We use `pytest` for parameterizing and running tests. You can see all options with:
Integration tests verify complete workflows across different providers using Llama Stack's record-replay system.
## Quick Start
```bash
# Run all integration tests with existing recordings
LLAMA_STACK_TEST_INFERENCE_MODE=replay \
LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
uv run --group test \
pytest -sv tests/integration/ --stack-config=starter
```
## Configuration Options
You can see all options with:
```bash
cd tests/integration
@ -10,11 +24,11 @@ pytest --help
Here are the most important options:
- `--stack-config`: specify the stack config to use. You have four ways to point to a stack:
- **`server:<config>`** - automatically start a server with the given config (e.g., `server:fireworks`). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
- **`server:<config>:<port>`** - same as above but with a custom port (e.g., `server:together:8322`)
- **`server:<config>`** - automatically start a server with the given config (e.g., `server:starter`). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
- **`server:<config>:<port>`** - same as above but with a custom port (e.g., `server:starter:8322`)
- a URL which points to a Llama Stack distribution server
- a template (e.g., `starter`) or a path to a `run.yaml` file
- a comma-separated list of api=provider pairs, e.g. `inference=fireworks,safety=llama-guard,agents=meta-reference`. This is most useful for testing a single API surface.
- a distribution name (e.g., `starter`) or a path to a `run.yaml` file
- a comma-separated list of api=provider pairs, e.g. `inference=ollama,safety=llama-guard,agents=meta-reference`. This is most useful for testing a single API surface.
- `--env`: set environment variables, e.g. --env KEY=value. this is a utility option to set environment variables required by various providers.
Model parameters can be influenced by the following options:
@ -32,85 +46,139 @@ if no model is specified.
### Testing against a Server
Run all text inference tests by auto-starting a server with the `fireworks` config:
Run all text inference tests by auto-starting a server with the `starter` config:
```bash
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config=server:fireworks \
--text-model=meta-llama/Llama-3.1-8B-Instruct
OLLAMA_URL=http://localhost:11434 \
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config=server:starter \
--text-model=ollama/llama3.2:3b-instruct-fp16 \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2
```
Run tests with auto-server startup on a custom port:
```bash
pytest -s -v tests/integration/inference/ \
--stack-config=server:together:8322 \
--text-model=meta-llama/Llama-3.1-8B-Instruct
```
Run multiple test suites with auto-server (eliminates manual server management):
```bash
# Auto-start server and run all integration tests
export FIREWORKS_API_KEY=<your_key>
pytest -s -v tests/integration/inference/ tests/integration/safety/ tests/integration/agents/ \
--stack-config=server:fireworks \
--text-model=meta-llama/Llama-3.1-8B-Instruct
OLLAMA_URL=http://localhost:11434 \
pytest -s -v tests/integration/inference/ \
--stack-config=server:starter:8322 \
--text-model=ollama/llama3.2:3b-instruct-fp16 \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2
```
### Testing with Library Client
Run all text inference tests with the `starter` distribution using the `together` provider:
The library client constructs the Stack "in-process" instead of using a server. This is useful during the iterative development process since you don't need to constantly start and stop servers.
You can do this by simply using `--stack-config=starter` instead of `--stack-config=server:starter`.
### Using ad-hoc distributions
Sometimes, you may want to make up a distribution on the fly. This is useful for testing a single provider or a single API or a small combination of providers. You can do so by specifying a comma-separated list of api=provider pairs to the `--stack-config` option, e.g. `inference=remote::ollama,safety=inline::llama-guard,agents=inline::meta-reference`.
```bash
ENABLE_TOGETHER=together pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config=starter \
--text-model=meta-llama/Llama-3.1-8B-Instruct
```
Run all text inference tests with the `starter` distribution using the `together` provider and `meta-llama/Llama-3.1-8B-Instruct`:
```bash
ENABLE_TOGETHER=together pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config=starter \
--text-model=meta-llama/Llama-3.1-8B-Instruct
```
Running all inference tests for a number of models using the `together` provider:
```bash
TEXT_MODELS=meta-llama/Llama-3.1-8B-Instruct,meta-llama/Llama-3.1-70B-Instruct
VISION_MODELS=meta-llama/Llama-3.2-11B-Vision-Instruct
EMBEDDING_MODELS=all-MiniLM-L6-v2
ENABLE_TOGETHER=together
export TOGETHER_API_KEY=<together_api_key>
pytest -s -v tests/integration/inference/ \
--stack-config=together \
--stack-config=inference=remote::ollama,safety=inline::llama-guard,agents=inline::meta-reference \
--text-model=$TEXT_MODELS \
--vision-model=$VISION_MODELS \
--embedding-model=$EMBEDDING_MODELS
```
Same thing but instead of using the distribution, use an adhoc stack with just one provider (`fireworks` for inference):
Another example: Running Vector IO tests for embedding models:
```bash
export FIREWORKS_API_KEY=<fireworks_api_key>
pytest -s -v tests/integration/inference/ \
--stack-config=inference=fireworks \
--text-model=$TEXT_MODELS \
--vision-model=$VISION_MODELS \
--embedding-model=$EMBEDDING_MODELS
```
Running Vector IO tests for a number of embedding models:
```bash
EMBEDDING_MODELS=all-MiniLM-L6-v2
pytest -s -v tests/integration/vector_io/ \
--stack-config=inference=sentence-transformers,vector_io=sqlite-vec \
--embedding-model=$EMBEDDING_MODELS
--stack-config=inference=inline::sentence-transformers,vector_io=inline::sqlite-vec \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2
```
## Recording Modes
The testing system supports three modes controlled by environment variables:
### LIVE Mode (Default)
Tests make real API calls:
```bash
LLAMA_STACK_TEST_INFERENCE_MODE=live pytest tests/integration/
```
### RECORD Mode
Captures API interactions for later replay:
```bash
LLAMA_STACK_TEST_INFERENCE_MODE=record \
LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
pytest tests/integration/inference/test_new_feature.py
```
### REPLAY Mode
Uses cached responses instead of making API calls:
```bash
LLAMA_STACK_TEST_INFERENCE_MODE=replay \
LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
pytest tests/integration/
```
Note that right now you must specify the recording directory. This is because different tests use different recording directories and we don't (yet) have a fool-proof way to map a test to a recording directory. We are working on this.
## Managing Recordings
### Viewing Recordings
```bash
# See what's recorded
sqlite3 recordings/index.sqlite "SELECT endpoint, model, timestamp FROM recordings;"
# Inspect specific response
cat recordings/responses/abc123.json | jq '.'
```
### Re-recording Tests
#### Remote Re-recording (Recommended)
Use the automated workflow script for easier re-recording:
```bash
./scripts/github/schedule-record-workflow.sh --test-subdirs "inference,agents"
```
See the [main testing guide](../README.md#remote-re-recording-recommended) for full details.
#### Local Re-recording
```bash
# Re-record specific tests
LLAMA_STACK_TEST_INFERENCE_MODE=record \
LLAMA_STACK_TEST_RECORDING_DIR=tests/integration/recordings \
pytest -s -v --stack-config=server:starter tests/integration/inference/test_modified.py
```
Note that when re-recording tests, you must use a Stack pointing to a server (i.e., `server:starter`). This subtlety exists because the set of tests run in server are a superset of the set of tests run in the library client.
## Writing Tests
### Basic Test Pattern
```python
def test_basic_completion(llama_stack_client, text_model_id):
response = llama_stack_client.inference.completion(
model_id=text_model_id,
content=CompletionMessage(role="user", content="Hello"),
)
# Test structure, not AI output quality
assert response.completion_message is not None
assert isinstance(response.completion_message.content, str)
assert len(response.completion_message.content) > 0
```
### Provider-Specific Tests
```python
def test_asymmetric_embeddings(llama_stack_client, embedding_model_id):
if embedding_model_id not in MODELS_SUPPORTING_TASK_TYPE:
pytest.skip(f"Model {embedding_model_id} doesn't support task types")
query_response = llama_stack_client.inference.embeddings(
model_id=embedding_model_id,
contents=["What is machine learning?"],
task_type="query",
)
assert query_response.embeddings is not None
```

View file

@ -133,24 +133,15 @@ def test_agent_simple(llama_stack_client, agent_config):
assert "I can't" in logs_str
@pytest.mark.skip(reason="this test was disabled for a long time, and now has turned flaky")
def test_agent_name(llama_stack_client, text_model_id):
agent_name = f"test-agent-{uuid4()}"
try:
agent = Agent(
llama_stack_client,
model=text_model_id,
instructions="You are a helpful assistant",
name=agent_name,
)
except TypeError:
agent = Agent(
llama_stack_client,
model=text_model_id,
instructions="You are a helpful assistant",
)
return
agent = Agent(
llama_stack_client,
model=text_model_id,
instructions="You are a helpful assistant",
name=agent_name,
)
session_id = agent.create_session(f"test-session-{uuid4()}")
agent.create_turn(

View file

@ -0,0 +1,5 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.

View file

@ -0,0 +1,122 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
"""Shared pytest fixtures for batch tests."""
import json
import time
import warnings
from contextlib import contextmanager
from io import BytesIO
import pytest
from llama_stack.apis.files import OpenAIFilePurpose
class BatchHelper:
"""Helper class for creating and managing batch input files."""
def __init__(self, client):
"""Initialize with either a batch_client or openai_client."""
self.client = client
@contextmanager
def create_file(self, content: str | list[dict], filename_prefix="batch_input"):
"""Context manager for creating and cleaning up batch input files.
Args:
content: Either a list of batch request dictionaries or raw string content
filename_prefix: Prefix for the generated filename (or full filename if content is string)
Yields:
The uploaded file object
"""
if isinstance(content, str):
# Handle raw string content (e.g., malformed JSONL, empty files)
file_content = content.encode("utf-8")
else:
# Handle list of batch request dictionaries
jsonl_content = "\n".join(json.dumps(req) for req in content)
file_content = jsonl_content.encode("utf-8")
filename = filename_prefix if filename_prefix.endswith(".jsonl") else f"{filename_prefix}.jsonl"
with BytesIO(file_content) as file_buffer:
file_buffer.name = filename
uploaded_file = self.client.files.create(file=file_buffer, purpose=OpenAIFilePurpose.BATCH)
try:
yield uploaded_file
finally:
try:
self.client.files.delete(uploaded_file.id)
except Exception:
warnings.warn(
f"Failed to cleanup file {uploaded_file.id}: {uploaded_file.filename}",
stacklevel=2,
)
def wait_for(
self,
batch_id: str,
max_wait_time: int = 60,
sleep_interval: int | None = None,
expected_statuses: set[str] | None = None,
timeout_action: str = "fail",
):
"""Wait for a batch to reach a terminal status.
Args:
batch_id: The batch ID to monitor
max_wait_time: Maximum time to wait in seconds (default: 60 seconds)
sleep_interval: Time to sleep between checks in seconds (default: 1/10th of max_wait_time, min 1s, max 15s)
expected_statuses: Set of expected terminal statuses (default: {"completed"})
timeout_action: Action on timeout - "fail" (pytest.fail) or "skip" (pytest.skip)
Returns:
The final batch object
Raises:
pytest.Failed: If batch reaches an unexpected status or timeout_action is "fail"
pytest.Skipped: If timeout_action is "skip" on timeout or unexpected status
"""
if sleep_interval is None:
# Default to 1/10th of max_wait_time, with min 1s and max 15s
sleep_interval = max(1, min(15, max_wait_time // 10))
if expected_statuses is None:
expected_statuses = {"completed"}
terminal_statuses = {"completed", "failed", "cancelled", "expired"}
unexpected_statuses = terminal_statuses - expected_statuses
start_time = time.time()
while time.time() - start_time < max_wait_time:
current_batch = self.client.batches.retrieve(batch_id)
if current_batch.status in expected_statuses:
return current_batch
elif current_batch.status in unexpected_statuses:
error_msg = f"Batch reached unexpected status: {current_batch.status}"
if timeout_action == "skip":
pytest.skip(error_msg)
else:
pytest.fail(error_msg)
time.sleep(sleep_interval)
timeout_msg = f"Batch did not reach expected status {expected_statuses} within {max_wait_time} seconds"
if timeout_action == "skip":
pytest.skip(timeout_msg)
else:
pytest.fail(timeout_msg)
@pytest.fixture
def batch_helper(openai_client):
"""Fixture that provides a BatchHelper instance for OpenAI client."""
return BatchHelper(openai_client)

View file

@ -0,0 +1,270 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
"""
Integration tests for the Llama Stack batch processing functionality.
This module contains comprehensive integration tests for the batch processing API,
using the OpenAI-compatible client interface for consistency.
Test Categories:
1. Core Batch Operations:
- test_batch_creation_and_retrieval: Comprehensive batch creation, structure validation, and retrieval
- test_batch_listing: Basic batch listing functionality
- test_batch_immediate_cancellation: Batch cancellation workflow
# TODO: cancel during processing
2. End-to-End Processing:
- test_batch_e2e_chat_completions: Full chat completions workflow with output and error validation
Note: Error conditions and edge cases are primarily tested in test_batches_errors.py
for better organization and separation of concerns.
CLEANUP WARNING: These tests currently create batches that are not automatically
cleaned up after test completion. This may lead to resource accumulation over
multiple test runs. Only test_batch_immediate_cancellation properly cancels its batch.
The test_batch_e2e_chat_completions test does clean up its output and error files.
"""
import json
class TestBatchesIntegration:
"""Integration tests for the batches API."""
def test_batch_creation_and_retrieval(self, openai_client, batch_helper, text_model_id):
"""Test comprehensive batch creation and retrieval scenarios."""
test_metadata = {
"test_type": "comprehensive",
"purpose": "creation_and_retrieval_test",
"version": "1.0",
"tags": "test,batch",
}
batch_requests = [
{
"custom_id": "request-1",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10,
},
}
]
with batch_helper.create_file(batch_requests, "batch_creation_test") as uploaded_file:
batch = openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
metadata=test_metadata,
)
assert batch.endpoint == "/v1/chat/completions"
assert batch.input_file_id == uploaded_file.id
assert batch.completion_window == "24h"
assert batch.metadata == test_metadata
retrieved_batch = openai_client.batches.retrieve(batch.id)
assert retrieved_batch.id == batch.id
assert retrieved_batch.object == batch.object
assert retrieved_batch.endpoint == batch.endpoint
assert retrieved_batch.input_file_id == batch.input_file_id
assert retrieved_batch.completion_window == batch.completion_window
assert retrieved_batch.metadata == batch.metadata
def test_batch_listing(self, openai_client, batch_helper, text_model_id):
"""
Test batch listing.
This test creates multiple batches and verifies that they can be listed.
It also deletes the input files before execution, which means the batches
will appear as failed due to missing input files. This is expected and
a good thing, because it means no inference is performed.
"""
batch_ids = []
for i in range(2):
batch_requests = [
{
"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": f"Hello {i}"}],
"max_tokens": 10,
},
}
]
with batch_helper.create_file(batch_requests, f"batch_input_{i}") as uploaded_file:
batch = openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
batch_ids.append(batch.id)
batch_list = openai_client.batches.list()
assert isinstance(batch_list.data, list)
listed_batch_ids = {b.id for b in batch_list.data}
for batch_id in batch_ids:
assert batch_id in listed_batch_ids
def test_batch_immediate_cancellation(self, openai_client, batch_helper, text_model_id):
"""Test immediate batch cancellation."""
batch_requests = [
{
"custom_id": "request-1",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10,
},
}
]
with batch_helper.create_file(batch_requests) as uploaded_file:
batch = openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
# hopefully cancel the batch before it completes
cancelling_batch = openai_client.batches.cancel(batch.id)
assert cancelling_batch.status in ["cancelling", "cancelled"]
assert isinstance(cancelling_batch.cancelling_at, int), (
f"cancelling_at should be int, got {type(cancelling_batch.cancelling_at)}"
)
final_batch = batch_helper.wait_for(
batch.id,
max_wait_time=3 * 60, # often takes 10-11 minutes, give it 3 min
expected_statuses={"cancelled"},
timeout_action="skip",
)
assert final_batch.status == "cancelled"
assert isinstance(final_batch.cancelled_at, int), (
f"cancelled_at should be int, got {type(final_batch.cancelled_at)}"
)
def test_batch_e2e_chat_completions(self, openai_client, batch_helper, text_model_id):
"""Test end-to-end batch processing for chat completions with both successful and failed operations."""
batch_requests = [
{
"custom_id": "success-1",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "Say hello"}],
"max_tokens": 20,
},
},
{
"custom_id": "error-1",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"rolez": "user", "contentz": "This should fail"}], # Invalid keys to trigger error
# note: ollama does not validate max_tokens values or the "role" key, so they won't trigger an error
},
},
]
with batch_helper.create_file(batch_requests) as uploaded_file:
batch = openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
metadata={"test": "e2e_success_and_errors_test"},
)
final_batch = batch_helper.wait_for(
batch.id,
max_wait_time=3 * 60, # often takes 2-3 minutes
expected_statuses={"completed"},
timeout_action="skip",
)
# Expecting a completed batch with both successful and failed requests
# Batch(id='batch_xxx',
# completion_window='24h',
# created_at=...,
# endpoint='/v1/chat/completions',
# input_file_id='file-xxx',
# object='batch',
# status='completed',
# output_file_id='file-xxx',
# error_file_id='file-xxx',
# request_counts=BatchRequestCounts(completed=1, failed=1, total=2))
assert final_batch.status == "completed"
assert final_batch.request_counts is not None
assert final_batch.request_counts.total == 2
assert final_batch.request_counts.completed == 1
assert final_batch.request_counts.failed == 1
assert final_batch.output_file_id is not None, "Output file should exist for successful requests"
output_content = openai_client.files.content(final_batch.output_file_id)
if isinstance(output_content, str):
output_text = output_content
else:
output_text = output_content.content.decode("utf-8")
output_lines = output_text.strip().split("\n")
for line in output_lines:
result = json.loads(line)
assert "id" in result
assert "custom_id" in result
assert result["custom_id"] == "success-1"
assert "response" in result
assert result["response"]["status_code"] == 200
assert "body" in result["response"]
assert "choices" in result["response"]["body"]
assert final_batch.error_file_id is not None, "Error file should exist for failed requests"
error_content = openai_client.files.content(final_batch.error_file_id)
if isinstance(error_content, str):
error_text = error_content
else:
error_text = error_content.content.decode("utf-8")
error_lines = error_text.strip().split("\n")
for line in error_lines:
result = json.loads(line)
assert "id" in result
assert "custom_id" in result
assert result["custom_id"] == "error-1"
assert "error" in result
error = result["error"]
assert error is not None
assert "code" in error or "message" in error, "Error should have code or message"
deleted_output_file = openai_client.files.delete(final_batch.output_file_id)
assert deleted_output_file.deleted, f"Output file {final_batch.output_file_id} was not deleted successfully"
deleted_error_file = openai_client.files.delete(final_batch.error_file_id)
assert deleted_error_file.deleted, f"Error file {final_batch.error_file_id} was not deleted successfully"

View file

@ -0,0 +1,693 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
"""
Error handling and edge case tests for the Llama Stack batch processing functionality.
This module focuses exclusively on testing error conditions, validation failures,
and edge cases for batch operations to ensure robust error handling and graceful
degradation.
Test Categories:
1. File and Input Validation:
- test_batch_nonexistent_file_id: Handling invalid file IDs
- test_batch_malformed_jsonl: Processing malformed JSONL input files
- test_file_malformed_batch_file: Handling malformed files at upload time
- test_batch_missing_required_fields: Validation of required request fields
2. API Endpoint and Model Validation:
- test_batch_invalid_endpoint: Invalid endpoint handling during creation
- test_batch_error_handling_invalid_model: Error handling with nonexistent models
- test_batch_endpoint_mismatch: Validation of endpoint/URL consistency
3. Batch Lifecycle Error Handling:
- test_batch_retrieve_nonexistent: Retrieving non-existent batches
- test_batch_cancel_nonexistent: Cancelling non-existent batches
- test_batch_cancel_completed: Attempting to cancel completed batches
4. Parameter and Configuration Validation:
- test_batch_invalid_completion_window: Invalid completion window values
- test_batch_invalid_metadata_types: Invalid metadata type validation
- test_batch_missing_required_body_fields: Validation of required fields in request body
5. Feature Restriction and Compatibility:
- test_batch_streaming_not_supported: Streaming request rejection
- test_batch_mixed_streaming_requests: Mixed streaming/non-streaming validation
Note: Core functionality and OpenAI compatibility tests are located in
test_batches_integration.py for better organization and separation of concerns.
CLEANUP WARNING: These tests create batches to test error conditions but do not
automatically clean them up after test completion. While most error tests create
batches that fail quickly, some may create valid batches that consume resources.
"""
import pytest
from openai import BadRequestError, ConflictError, NotFoundError
class TestBatchesErrorHandling:
"""Error handling and edge case tests for the batches API using OpenAI client."""
def test_batch_nonexistent_file_id(self, openai_client, batch_helper):
"""Test batch creation with nonexistent input file ID."""
batch = openai_client.batches.create(
input_file_id="file-nonexistent-xyz",
endpoint="/v1/chat/completions",
completion_window="24h",
)
final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
# Expecting -
# Batch(...,
# status='failed',
# errors=Errors(data=[
# BatchError(
# code='invalid_request',
# line=None,
# message='Cannot find file ..., or organization ... does not have access to it.',
# param='file_id')
# ], object='list'),
# failed_at=1754566971,
# ...)
assert final_batch.status == "failed"
assert final_batch.errors is not None
assert len(final_batch.errors.data) == 1
error = final_batch.errors.data[0]
assert error.code == "invalid_request"
assert "cannot find file" in error.message.lower()
def test_batch_invalid_endpoint(self, openai_client, batch_helper, text_model_id):
"""Test batch creation with invalid endpoint."""
batch_requests = [
{
"custom_id": "invalid-endpoint",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10,
},
}
]
with batch_helper.create_file(batch_requests) as uploaded_file:
with pytest.raises(BadRequestError) as exc_info:
openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/invalid/endpoint",
completion_window="24h",
)
# Expected -
# Error code: 400 - {
# 'error': {
# 'message': "Invalid value: '/v1/invalid/endpoint'. Supported values are: '/v1/chat/completions', '/v1/completions', '/v1/embeddings', and '/v1/responses'.",
# 'type': 'invalid_request_error',
# 'param': 'endpoint',
# 'code': 'invalid_value'
# }
# }
error_msg = str(exc_info.value).lower()
assert exc_info.value.status_code == 400
assert "invalid value" in error_msg
assert "/v1/invalid/endpoint" in error_msg
assert "supported values" in error_msg
assert "endpoint" in error_msg
assert "invalid_value" in error_msg
def test_batch_malformed_jsonl(self, openai_client, batch_helper):
"""
Test batch with malformed JSONL input.
The /v1/files endpoint requires valid JSONL format, so we provide a well formed line
before a malformed line to ensure we get to the /v1/batches validation stage.
"""
with batch_helper.create_file(
"""{"custom_id": "valid", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "test"}}
{invalid json here""",
"malformed_batch_input.jsonl",
) as uploaded_file:
batch = openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
# Expecting -
# Batch(...,
# status='failed',
# errors=Errors(data=[
# ...,
# BatchError(code='invalid_json_line',
# line=2,
# message='This line is not parseable as valid JSON.',
# param=None)
# ], object='list'),
# ...)
assert final_batch.status == "failed"
assert final_batch.errors is not None
assert len(final_batch.errors.data) > 0
error = final_batch.errors.data[-1] # get last error because first may be about the "test" model
assert error.code == "invalid_json_line"
assert error.line == 2
assert "not" in error.message.lower()
assert "valid json" in error.message.lower()
@pytest.mark.xfail(reason="Not all file providers validate content")
@pytest.mark.parametrize("batch_requests", ["", "{malformed json"], ids=["empty", "malformed"])
def test_file_malformed_batch_file(self, openai_client, batch_helper, batch_requests):
"""Test file upload with malformed content."""
with pytest.raises(BadRequestError) as exc_info:
with batch_helper.create_file(batch_requests, "malformed_batch_input_file.jsonl"):
# /v1/files rejects the file, we don't get to batch creation
pass
error_msg = str(exc_info.value).lower()
assert exc_info.value.status_code == 400
assert "invalid file format" in error_msg
assert "jsonl" in error_msg
def test_batch_retrieve_nonexistent(self, openai_client):
"""Test retrieving nonexistent batch."""
with pytest.raises(NotFoundError) as exc_info:
openai_client.batches.retrieve("batch-nonexistent-xyz")
error_msg = str(exc_info.value).lower()
assert exc_info.value.status_code == 404
assert "no batch found" in error_msg or "not found" in error_msg
def test_batch_cancel_nonexistent(self, openai_client):
"""Test cancelling nonexistent batch."""
with pytest.raises(NotFoundError) as exc_info:
openai_client.batches.cancel("batch-nonexistent-xyz")
error_msg = str(exc_info.value).lower()
assert exc_info.value.status_code == 404
assert "no batch found" in error_msg or "not found" in error_msg
def test_batch_cancel_completed(self, openai_client, batch_helper, text_model_id):
"""Test cancelling already completed batch."""
batch_requests = [
{
"custom_id": "cancel-completed",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "Quick test"}],
"max_tokens": 5,
},
}
]
with batch_helper.create_file(batch_requests, "cancel_test_batch_input") as uploaded_file:
batch = openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
final_batch = batch_helper.wait_for(
batch.id,
max_wait_time=3 * 60, # often take 10-11 min, give it 3 min
expected_statuses={"completed"},
timeout_action="skip",
)
deleted_file = openai_client.files.delete(final_batch.output_file_id)
assert deleted_file.deleted, f"File {final_batch.output_file_id} was not deleted successfully"
with pytest.raises(ConflictError) as exc_info:
openai_client.batches.cancel(batch.id)
# Expecting -
# Error code: 409 - {
# 'error': {
# 'message': "Cannot cancel a batch with status 'completed'.",
# 'type': 'invalid_request_error',
# 'param': None,
# 'code': None
# }
# }
#
# NOTE: Same for "failed", cancelling "cancelled" batches is allowed
error_msg = str(exc_info.value).lower()
assert exc_info.value.status_code == 409
assert "cannot cancel" in error_msg
def test_batch_missing_required_fields(self, openai_client, batch_helper, text_model_id):
"""Test batch with requests missing required fields."""
batch_requests = [
{
# Missing custom_id
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "No custom_id"}],
"max_tokens": 10,
},
},
{
"custom_id": "no-method",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "No method"}],
"max_tokens": 10,
},
},
{
"custom_id": "no-url",
"method": "POST",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "No URL"}],
"max_tokens": 10,
},
},
{
"custom_id": "no-body",
"method": "POST",
"url": "/v1/chat/completions",
},
]
with batch_helper.create_file(batch_requests, "missing_fields_batch_input") as uploaded_file:
batch = openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
# Expecting -
# Batch(...,
# status='failed',
# errors=Errors(
# data=[
# BatchError(
# code='missing_required_parameter',
# line=1,
# message="Missing required parameter: 'custom_id'.",
# param='custom_id'
# ),
# BatchError(
# code='missing_required_parameter',
# line=2,
# message="Missing required parameter: 'method'.",
# param='method'
# ),
# BatchError(
# code='missing_required_parameter',
# line=3,
# message="Missing required parameter: 'url'.",
# param='url'
# ),
# BatchError(
# code='missing_required_parameter',
# line=4,
# message="Missing required parameter: 'body'.",
# param='body'
# )
# ], object='list'),
# failed_at=1754566945,
# ...)
# )
assert final_batch.status == "failed"
assert final_batch.errors is not None
assert len(final_batch.errors.data) == 4
no_custom_id_error = final_batch.errors.data[0]
assert no_custom_id_error.code == "missing_required_parameter"
assert no_custom_id_error.line == 1
assert "missing" in no_custom_id_error.message.lower()
assert "custom_id" in no_custom_id_error.message.lower()
no_method_error = final_batch.errors.data[1]
assert no_method_error.code == "missing_required_parameter"
assert no_method_error.line == 2
assert "missing" in no_method_error.message.lower()
assert "method" in no_method_error.message.lower()
no_url_error = final_batch.errors.data[2]
assert no_url_error.code == "missing_required_parameter"
assert no_url_error.line == 3
assert "missing" in no_url_error.message.lower()
assert "url" in no_url_error.message.lower()
no_body_error = final_batch.errors.data[3]
assert no_body_error.code == "missing_required_parameter"
assert no_body_error.line == 4
assert "missing" in no_body_error.message.lower()
assert "body" in no_body_error.message.lower()
def test_batch_invalid_completion_window(self, openai_client, batch_helper, text_model_id):
"""Test batch creation with invalid completion window."""
batch_requests = [
{
"custom_id": "invalid-completion-window",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10,
},
}
]
with batch_helper.create_file(batch_requests) as uploaded_file:
for window in ["1h", "48h", "invalid", ""]:
with pytest.raises(BadRequestError) as exc_info:
openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window=window,
)
assert exc_info.value.status_code == 400
error_msg = str(exc_info.value).lower()
assert "error" in error_msg
assert "completion_window" in error_msg
def test_batch_streaming_not_supported(self, openai_client, batch_helper, text_model_id):
"""Test that streaming responses are not supported in batches."""
batch_requests = [
{
"custom_id": "streaming-test",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10,
"stream": True, # Not supported
},
}
]
with batch_helper.create_file(batch_requests, "streaming_batch_input") as uploaded_file:
batch = openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
# Expecting -
# Batch(...,
# status='failed',
# errors=Errors(data=[
# BatchError(code='streaming_unsupported',
# line=1,
# message='Chat Completions: Streaming is not supported in the Batch API.',
# param='body.stream')
# ], object='list'),
# failed_at=1754566965,
# ...)
assert final_batch.status == "failed"
assert final_batch.errors is not None
assert len(final_batch.errors.data) == 1
error = final_batch.errors.data[0]
assert error.code == "streaming_unsupported"
assert error.line == 1
assert "streaming" in error.message.lower()
assert "not supported" in error.message.lower()
assert error.param == "body.stream"
assert final_batch.failed_at is not None
def test_batch_mixed_streaming_requests(self, openai_client, batch_helper, text_model_id):
"""
Test batch with mixed streaming and non-streaming requests.
This is distinct from test_batch_streaming_not_supported, which tests a single
streaming request, to ensure an otherwise valid batch fails when a single
streaming request is included.
"""
batch_requests = [
{
"custom_id": "valid-non-streaming-request",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "Hello without streaming"}],
"max_tokens": 10,
},
},
{
"custom_id": "streaming-request",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "Hello with streaming"}],
"max_tokens": 10,
"stream": True, # Not supported
},
},
]
with batch_helper.create_file(batch_requests, "mixed_streaming_batch_input") as uploaded_file:
batch = openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
# Expecting -
# Batch(...,
# status='failed',
# errors=Errors(data=[
# BatchError(
# code='streaming_unsupported',
# line=2,
# message='Chat Completions: Streaming is not supported in the Batch API.',
# param='body.stream')
# ], object='list'),
# failed_at=1754574442,
# ...)
assert final_batch.status == "failed"
assert final_batch.errors is not None
assert len(final_batch.errors.data) == 1
error = final_batch.errors.data[0]
assert error.code == "streaming_unsupported"
assert error.line == 2
assert "streaming" in error.message.lower()
assert "not supported" in error.message.lower()
assert error.param == "body.stream"
assert final_batch.failed_at is not None
def test_batch_endpoint_mismatch(self, openai_client, batch_helper, text_model_id):
"""Test batch creation with mismatched endpoint and request URL."""
batch_requests = [
{
"custom_id": "endpoint-mismatch",
"method": "POST",
"url": "/v1/embeddings", # Different from batch endpoint
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "Hello"}],
},
}
]
with batch_helper.create_file(batch_requests, "endpoint_mismatch_batch_input") as uploaded_file:
batch = openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions", # Different from request URL
completion_window="24h",
)
final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
# Expecting -
# Batch(...,
# status='failed',
# errors=Errors(data=[
# BatchError(
# code='invalid_url',
# line=1,
# message='The URL provided for this request does not match the batch endpoint.',
# param='url')
# ], object='list'),
# failed_at=1754566972,
# ...)
assert final_batch.status == "failed"
assert final_batch.errors is not None
assert len(final_batch.errors.data) == 1
error = final_batch.errors.data[0]
assert error.line == 1
assert error.code == "invalid_url"
assert "does not match" in error.message.lower()
assert "endpoint" in error.message.lower()
assert final_batch.failed_at is not None
def test_batch_error_handling_invalid_model(self, openai_client, batch_helper):
"""Test batch error handling with invalid model."""
batch_requests = [
{
"custom_id": "invalid-model",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "nonexistent-model-xyz",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10,
},
}
]
with batch_helper.create_file(batch_requests) as uploaded_file:
batch = openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
# Expecting -
# Batch(...,
# status='failed',
# errors=Errors(data=[
# BatchError(code='model_not_found',
# line=1,
# message="The provided model 'nonexistent-model-xyz' is not supported by the Batch API.",
# param='body.model')
# ], object='list'),
# failed_at=1754566978,
# ...)
assert final_batch.status == "failed"
assert final_batch.errors is not None
assert len(final_batch.errors.data) == 1
error = final_batch.errors.data[0]
assert error.line == 1
assert error.code == "model_not_found"
assert "not supported" in error.message.lower()
assert error.param == "body.model"
assert final_batch.failed_at is not None
def test_batch_missing_required_body_fields(self, openai_client, batch_helper, text_model_id):
"""Test batch with requests missing required fields in body (model and messages)."""
batch_requests = [
{
"custom_id": "missing-model",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
# Missing model field
"messages": [{"role": "user", "content": "Hello without model"}],
"max_tokens": 10,
},
},
{
"custom_id": "missing-messages",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
# Missing messages field
"max_tokens": 10,
},
},
]
with batch_helper.create_file(batch_requests, "missing_body_fields_batch_input") as uploaded_file:
batch = openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
final_batch = batch_helper.wait_for(batch.id, expected_statuses={"failed"})
# Expecting -
# Batch(...,
# status='failed',
# errors=Errors(data=[
# BatchError(
# code='invalid_request',
# line=1,
# message='Model parameter is required.',
# param='body.model'),
# BatchError(
# code='invalid_request',
# line=2,
# message='Messages parameter is required.',
# param='body.messages')
# ], object='list'),
# ...)
assert final_batch.status == "failed"
assert final_batch.errors is not None
assert len(final_batch.errors.data) == 2
model_error = final_batch.errors.data[0]
assert model_error.line == 1
assert "model" in model_error.message.lower()
assert model_error.param == "body.model"
messages_error = final_batch.errors.data[1]
assert messages_error.line == 2
assert "messages" in messages_error.message.lower()
assert messages_error.param == "body.messages"
assert final_batch.failed_at is not None
def test_batch_invalid_metadata_types(self, openai_client, batch_helper, text_model_id):
"""Test batch creation with invalid metadata types (like lists)."""
batch_requests = [
{
"custom_id": "invalid-metadata-type",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": text_model_id,
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 10,
},
}
]
with batch_helper.create_file(batch_requests) as uploaded_file:
with pytest.raises(Exception) as exc_info:
openai_client.batches.create(
input_file_id=uploaded_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
metadata={
"tags": ["tag1", "tag2"], # Invalid type, should be a string
},
)
# Expecting -
# Error code: 400 - {'error':
# {'message': "Invalid type for 'metadata.tags': expected a string,
# but got an array instead.",
# 'type': 'invalid_request_error', 'param': 'metadata.tags',
# 'code': 'invalid_type'}}
error_msg = str(exc_info.value).lower()
assert "400" in error_msg
assert "tags" in error_msg
assert "string" in error_msg

View file

@ -5,7 +5,6 @@
# the root directory of this source tree.
import os
import re
from pathlib import Path
import pytest
@ -48,19 +47,6 @@ def _load_all_verification_configs():
return {"providers": all_provider_configs}
def case_id_generator(case):
"""Generate a test ID from the case's 'case_id' field, or use a default."""
case_id = case.get("case_id")
if isinstance(case_id, str | int):
return re.sub(r"\\W|^(?=\\d)", "_", str(case_id))
return None
# Helper to get the base test name from the request object
def get_base_test_name(request):
return request.node.originalname
# --- End Helper Functions ---

View file

@ -1,16 +0,0 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
from pathlib import Path
import yaml
def load_test_cases(name: str):
fixture_dir = Path(__file__).parent / "test_cases"
yaml_path = fixture_dir / f"{name}.yaml"
with open(yaml_path) as f:
return yaml.safe_load(f)

View file

@ -0,0 +1,262 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
from typing import Any
import pytest
from pydantic import BaseModel
class ResponsesTestCase(BaseModel):
# Input can be a simple string or complex message structure
input: str | list[dict[str, Any]]
expected: str
# Tools as flexible dict structure (gets validated at runtime by the API)
tools: list[dict[str, Any]] | None = None
# Multi-turn conversations with input/output pairs
turns: list[tuple[str | list[dict[str, Any]], str]] | None = None
# File search specific fields
file_content: str | None = None
file_path: str | None = None
# Streaming flag
stream: bool | None = None
# Basic response test cases
basic_test_cases = [
pytest.param(
ResponsesTestCase(
input="Which planet do humans live on?",
expected="earth",
),
id="earth",
),
pytest.param(
ResponsesTestCase(
input="Which planet has rings around it with a name starting with letter S?",
expected="saturn",
),
id="saturn",
),
pytest.param(
ResponsesTestCase(
input=[
{
"role": "user",
"content": [
{
"type": "input_text",
"text": "what teams are playing in this image?",
}
],
},
{
"role": "user",
"content": [
{
"type": "input_image",
"image_url": "https://upload.wikimedia.org/wikipedia/commons/3/3b/LeBron_James_Layup_%28Cleveland_vs_Brooklyn_2018%29.jpg",
}
],
},
],
expected="brooklyn nets",
),
id="image_input",
),
]
# Multi-turn test cases
multi_turn_test_cases = [
pytest.param(
ResponsesTestCase(
input="", # Not used for multi-turn
expected="", # Not used for multi-turn
turns=[
("Which planet do humans live on?", "earth"),
("What is the name of the planet from your previous response?", "earth"),
],
),
id="earth",
),
]
# Web search test cases
web_search_test_cases = [
pytest.param(
ResponsesTestCase(
input="How many experts does the Llama 4 Maverick model have?",
tools=[{"type": "web_search", "search_context_size": "low"}],
expected="128",
),
id="llama_experts",
),
]
# File search test cases
file_search_test_cases = [
pytest.param(
ResponsesTestCase(
input="How many experts does the Llama 4 Maverick model have?",
tools=[{"type": "file_search"}],
expected="128",
file_content="Llama 4 Maverick has 128 experts",
),
id="llama_experts",
),
pytest.param(
ResponsesTestCase(
input="How many experts does the Llama 4 Maverick model have?",
tools=[{"type": "file_search"}],
expected="128",
file_path="pdfs/llama_stack_and_models.pdf",
),
id="llama_experts_pdf",
),
]
# MCP tool test cases
mcp_tool_test_cases = [
pytest.param(
ResponsesTestCase(
input="What is the boiling point of myawesomeliquid in Celsius?",
tools=[{"type": "mcp", "server_label": "localmcp", "server_url": "<FILLED_BY_TEST_RUNNER>"}],
expected="Hello, world!",
),
id="boiling_point_tool",
),
]
# Custom tool test cases
custom_tool_test_cases = [
pytest.param(
ResponsesTestCase(
input="What's the weather like in San Francisco?",
tools=[
{
"type": "function",
"name": "get_weather",
"description": "Get current temperature for a given location.",
"parameters": {
"additionalProperties": False,
"properties": {
"location": {
"description": "City and country e.g. Bogotá, Colombia",
"type": "string",
}
},
"required": ["location"],
"type": "object",
},
}
],
expected="", # No specific expected output for custom tools
),
id="sf_weather",
),
]
# Image test cases
image_test_cases = [
pytest.param(
ResponsesTestCase(
input=[
{
"role": "user",
"content": [
{
"type": "input_text",
"text": "Identify the type of animal in this image.",
},
{
"type": "input_image",
"image_url": "https://upload.wikimedia.org/wikipedia/commons/f/f7/Llamas%2C_Vernagt-Stausee%2C_Italy.jpg",
},
],
},
],
expected="llama",
),
id="llama_image",
),
]
# Multi-turn image test cases
multi_turn_image_test_cases = [
pytest.param(
ResponsesTestCase(
input="", # Not used for multi-turn
expected="", # Not used for multi-turn
turns=[
(
[
{
"role": "user",
"content": [
{
"type": "input_text",
"text": "What type of animal is in this image? Please respond with a single word that starts with the letter 'L'.",
},
{
"type": "input_image",
"image_url": "https://upload.wikimedia.org/wikipedia/commons/f/f7/Llamas%2C_Vernagt-Stausee%2C_Italy.jpg",
},
],
},
],
"llama",
),
(
"What country do you find this animal primarily in? What continent?",
"peru",
),
],
),
id="llama_image_understanding",
),
]
# Multi-turn tool execution test cases
multi_turn_tool_execution_test_cases = [
pytest.param(
ResponsesTestCase(
input="I need to check if user 'alice' can access the file 'document.txt'. First, get alice's user ID, then check if that user ID can access the file 'document.txt'. Do this as a series of steps, where each step is a separate message. Return only one tool call per step. Summarize the final result with a single 'yes' or 'no' response.",
tools=[{"type": "mcp", "server_label": "localmcp", "server_url": "<FILLED_BY_TEST_RUNNER>"}],
expected="yes",
),
id="user_file_access_check",
),
pytest.param(
ResponsesTestCase(
input="I need to get the results for the 'boiling_point' experiment. First, get the experiment ID for 'boiling_point', then use that ID to get the experiment results. Tell me the boiling point in Celsius.",
tools=[{"type": "mcp", "server_label": "localmcp", "server_url": "<FILLED_BY_TEST_RUNNER>"}],
expected="100°C",
),
id="experiment_results_lookup",
),
]
# Multi-turn tool execution streaming test cases
multi_turn_tool_execution_streaming_test_cases = [
pytest.param(
ResponsesTestCase(
input="Help me with this security check: First, get the user ID for 'charlie', then get the permissions for that user ID, and finally check if that user can access 'secret_file.txt'. Stream your progress as you work through each step. Return only one tool call per step. Summarize the final result with a single 'yes' or 'no' response.",
tools=[{"type": "mcp", "server_label": "localmcp", "server_url": "<FILLED_BY_TEST_RUNNER>"}],
expected="no",
stream=True,
),
id="user_permissions_workflow",
),
pytest.param(
ResponsesTestCase(
input="I need a complete analysis: First, get the experiment ID for 'chemical_reaction', then get the results for that experiment, and tell me if the yield was above 80%. Return only one tool call per step. Please stream your analysis process.",
tools=[{"type": "mcp", "server_label": "localmcp", "server_url": "<FILLED_BY_TEST_RUNNER>"}],
expected="85%",
stream=True,
),
id="experiment_analysis_streaming",
),
]

View file

@ -1,397 +0,0 @@
test_chat_basic:
test_name: test_chat_basic
test_params:
case:
- case_id: "earth"
input:
messages:
- content: Which planet do humans live on?
role: user
output: Earth
- case_id: "saturn"
input:
messages:
- content: Which planet has rings around it with a name starting with letter
S?
role: user
output: Saturn
test_chat_input_validation:
test_name: test_chat_input_validation
test_params:
case:
- case_id: "messages_missing"
input:
messages: []
output:
error:
status_code: 400
- case_id: "messages_role_invalid"
input:
messages:
- content: Which planet do humans live on?
role: fake_role
output:
error:
status_code: 400
- case_id: "tool_choice_invalid"
input:
messages:
- content: Which planet do humans live on?
role: user
tool_choice: invalid
output:
error:
status_code: 400
- case_id: "tool_choice_no_tools"
input:
messages:
- content: Which planet do humans live on?
role: user
tool_choice: required
output:
error:
status_code: 400
- case_id: "tools_type_invalid"
input:
messages:
- content: Which planet do humans live on?
role: user
tools:
- type: invalid
output:
error:
status_code: 400
test_chat_image:
test_name: test_chat_image
test_params:
case:
- input:
messages:
- content:
- text: What is in this image?
type: text
- image_url:
url: https://upload.wikimedia.org/wikipedia/commons/f/f7/Llamas%2C_Vernagt-Stausee%2C_Italy.jpg
type: image_url
role: user
output: llama
test_chat_structured_output:
test_name: test_chat_structured_output
test_params:
case:
- case_id: "calendar"
input:
messages:
- content: Extract the event information.
role: system
- content: Alice and Bob are going to a science fair on Friday.
role: user
response_format:
json_schema:
name: calendar_event
schema:
properties:
date:
title: Date
type: string
name:
title: Name
type: string
participants:
items:
type: string
title: Participants
type: array
required:
- name
- date
- participants
title: CalendarEvent
type: object
type: json_schema
output: valid_calendar_event
- case_id: "math"
input:
messages:
- content: You are a helpful math tutor. Guide the user through the solution
step by step.
role: system
- content: how can I solve 8x + 7 = -23
role: user
response_format:
json_schema:
name: math_reasoning
schema:
$defs:
Step:
properties:
explanation:
title: Explanation
type: string
output:
title: Output
type: string
required:
- explanation
- output
title: Step
type: object
properties:
final_answer:
title: Final Answer
type: string
steps:
items:
$ref: '#/$defs/Step'
title: Steps
type: array
required:
- steps
- final_answer
title: MathReasoning
type: object
type: json_schema
output: valid_math_reasoning
test_tool_calling:
test_name: test_tool_calling
test_params:
case:
- input:
messages:
- content: You are a helpful assistant that can use tools to get information.
role: system
- content: What's the weather like in San Francisco?
role: user
tools:
- function:
description: Get current temperature for a given location.
name: get_weather
parameters:
additionalProperties: false
properties:
location:
description: "City and country e.g. Bogot\xE1, Colombia"
type: string
required:
- location
type: object
type: function
output: get_weather_tool_call
test_chat_multi_turn_tool_calling:
test_name: test_chat_multi_turn_tool_calling
test_params:
case:
- case_id: "text_then_weather_tool"
input:
messages:
- - role: user
content: "What's the name of the Sun in latin?"
- - role: user
content: "What's the weather like in San Francisco?"
tools:
- function:
description: Get the current weather
name: get_weather
parameters:
type: object
properties:
location:
description: "The city and state (both required), e.g. San Francisco, CA."
type: string
required: ["location"]
type: function
tool_responses:
- response: "{'response': '70 degrees and foggy'}"
expected:
- num_tool_calls: 0
answer: ["sol"]
- num_tool_calls: 1
tool_name: get_weather
tool_arguments:
location: "San Francisco, CA"
- num_tool_calls: 0
answer: ["foggy", "70 degrees"]
- case_id: "weather_tool_then_text"
input:
messages:
- - role: user
content: "What's the weather like in San Francisco?"
tools:
- function:
description: Get the current weather
name: get_weather
parameters:
type: object
properties:
location:
description: "The city and state (both required), e.g. San Francisco, CA."
type: string
required: ["location"]
type: function
tool_responses:
- response: "{'response': '70 degrees and foggy'}"
expected:
- num_tool_calls: 1
tool_name: get_weather
tool_arguments:
location: "San Francisco, CA"
- num_tool_calls: 0
answer: ["foggy", "70 degrees"]
- case_id: "add_product_tool"
input:
messages:
- - role: user
content: "Please add a new product with name 'Widget', price 19.99, in stock, and tags ['new', 'sale'] and give me the product id."
tools:
- function:
description: Add a new product
name: addProduct
parameters:
type: object
properties:
name:
description: "Name of the product"
type: string
price:
description: "Price of the product"
type: number
inStock:
description: "Availability status of the product."
type: boolean
tags:
description: "List of product tags"
type: array
items:
type: string
required: ["name", "price", "inStock"]
type: function
tool_responses:
- response: "{'response': 'Successfully added product with id: 123'}"
expected:
- num_tool_calls: 1
tool_name: addProduct
tool_arguments:
name: "Widget"
price: 19.99
inStock: true
tags:
- "new"
- "sale"
- num_tool_calls: 0
answer: ["123", "product id: 123"]
- case_id: "get_then_create_event_tool"
input:
messages:
- - role: system
content: "Todays date is 2025-03-01."
- role: user
content: "Do i have any meetings on March 3rd at 10 am? Yes or no?"
- - role: user
content: "Alright then, Create an event named 'Team Building', scheduled for that time same time, in the 'Main Conference Room' and add Alice, Bob, Charlie to it. Give me the created event id."
tools:
- function:
description: Create a new event
name: create_event
parameters:
type: object
properties:
name:
description: "Name of the event"
type: string
date:
description: "Date of the event in ISO format"
type: string
time:
description: "Event Time (HH:MM)"
type: string
location:
description: "Location of the event"
type: string
participants:
description: "List of participant names"
type: array
items:
type: string
required: ["name", "date", "time", "location", "participants"]
type: function
- function:
description: Get an event by date and time
name: get_event
parameters:
type: object
properties:
date:
description: "Date of the event in ISO format"
type: string
time:
description: "Event Time (HH:MM)"
type: string
required: ["date", "time"]
type: function
tool_responses:
- response: "{'response': 'No events found for 2025-03-03 at 10:00'}"
- response: "{'response': 'Successfully created new event with id: e_123'}"
expected:
- num_tool_calls: 1
tool_name: get_event
tool_arguments:
date: "2025-03-03"
time: "10:00"
- num_tool_calls: 0
answer: ["no", "no events found", "no meetings"]
- num_tool_calls: 1
tool_name: create_event
tool_arguments:
name: "Team Building"
date: "2025-03-03"
time: "10:00"
location: "Main Conference Room"
participants:
- "Alice"
- "Bob"
- "Charlie"
- num_tool_calls: 0
answer: ["e_123", "event id: e_123"]
- case_id: "compare_monthly_expense_tool"
input:
messages:
- - role: system
content: "Todays date is 2025-03-01."
- role: user
content: "what was my monthly expense in Jan of this year?"
- - role: user
content: "Was it less than Feb of last year? Only answer with yes or no."
tools:
- function:
description: Get monthly expense summary
name: getMonthlyExpenseSummary
parameters:
type: object
properties:
month:
description: "Month of the year (1-12)"
type: integer
year:
description: "Year"
type: integer
required: ["month", "year"]
type: function
tool_responses:
- response: "{'response': 'Total expenses for January 2025: $1000'}"
- response: "{'response': 'Total expenses for February 2024: $2000'}"
expected:
- num_tool_calls: 1
tool_name: getMonthlyExpenseSummary
tool_arguments:
month: 1
year: 2025
- num_tool_calls: 0
answer: ["1000", "$1,000", "1,000"]
- num_tool_calls: 1
tool_name: getMonthlyExpenseSummary
tool_arguments:
month: 2
year: 2024
- num_tool_calls: 0
answer: ["yes"]

View file

@ -1,166 +0,0 @@
test_response_basic:
test_name: test_response_basic
test_params:
case:
- case_id: "earth"
input: "Which planet do humans live on?"
output: "earth"
- case_id: "saturn"
input: "Which planet has rings around it with a name starting with letter S?"
output: "saturn"
- case_id: "image_input"
input:
- role: user
content:
- type: input_text
text: "what teams are playing in this image?"
- role: user
content:
- type: input_image
image_url: "https://upload.wikimedia.org/wikipedia/commons/3/3b/LeBron_James_Layup_%28Cleveland_vs_Brooklyn_2018%29.jpg"
output: "brooklyn nets"
test_response_multi_turn:
test_name: test_response_multi_turn
test_params:
case:
- case_id: "earth"
turns:
- input: "Which planet do humans live on?"
output: "earth"
- input: "What is the name of the planet from your previous response?"
output: "earth"
test_response_web_search:
test_name: test_response_web_search
test_params:
case:
- case_id: "llama_experts"
input: "How many experts does the Llama 4 Maverick model have?"
tools:
- type: web_search
search_context_size: "low"
output: "128"
test_response_file_search:
test_name: test_response_file_search
test_params:
case:
- case_id: "llama_experts"
input: "How many experts does the Llama 4 Maverick model have?"
tools:
- type: file_search
# vector_store_ids param for file_search tool gets added by the test runner
file_content: "Llama 4 Maverick has 128 experts"
output: "128"
- case_id: "llama_experts_pdf"
input: "How many experts does the Llama 4 Maverick model have?"
tools:
- type: file_search
# vector_store_ids param for file_search toolgets added by the test runner
file_path: "pdfs/llama_stack_and_models.pdf"
output: "128"
test_response_mcp_tool:
test_name: test_response_mcp_tool
test_params:
case:
- case_id: "boiling_point_tool"
input: "What is the boiling point of myawesomeliquid in Celsius?"
tools:
- type: mcp
server_label: "localmcp"
server_url: "<FILLED_BY_TEST_RUNNER>"
output: "Hello, world!"
test_response_custom_tool:
test_name: test_response_custom_tool
test_params:
case:
- case_id: "sf_weather"
input: "What's the weather like in San Francisco?"
tools:
- type: function
name: get_weather
description: Get current temperature for a given location.
parameters:
additionalProperties: false
properties:
location:
description: "City and country e.g. Bogot\xE1, Colombia"
type: string
required:
- location
type: object
test_response_image:
test_name: test_response_image
test_params:
case:
- case_id: "llama_image"
input:
- role: user
content:
- type: input_text
text: "Identify the type of animal in this image."
- type: input_image
image_url: "https://upload.wikimedia.org/wikipedia/commons/f/f7/Llamas%2C_Vernagt-Stausee%2C_Italy.jpg"
output: "llama"
# the models are really poor at tool calling after seeing images :/
test_response_multi_turn_image:
test_name: test_response_multi_turn_image
test_params:
case:
- case_id: "llama_image_understanding"
turns:
- input:
- role: user
content:
- type: input_text
text: "What type of animal is in this image? Please respond with a single word that starts with the letter 'L'."
- type: input_image
image_url: "https://upload.wikimedia.org/wikipedia/commons/f/f7/Llamas%2C_Vernagt-Stausee%2C_Italy.jpg"
output: "llama"
- input: "What country do you find this animal primarily in? What continent?"
output: "peru"
test_response_multi_turn_tool_execution:
test_name: test_response_multi_turn_tool_execution
test_params:
case:
- case_id: "user_file_access_check"
input: "I need to check if user 'alice' can access the file 'document.txt'. First, get alice's user ID, then check if that user ID can access the file 'document.txt'. Do this as a series of steps, where each step is a separate message. Return only one tool call per step. Summarize the final result with a single 'yes' or 'no' response."
tools:
- type: mcp
server_label: "localmcp"
server_url: "<FILLED_BY_TEST_RUNNER>"
output: "yes"
- case_id: "experiment_results_lookup"
input: "I need to get the results for the 'boiling_point' experiment. First, get the experiment ID for 'boiling_point', then use that ID to get the experiment results. Tell me the boiling point in Celsius."
tools:
- type: mcp
server_label: "localmcp"
server_url: "<FILLED_BY_TEST_RUNNER>"
output: "100°C"
test_response_multi_turn_tool_execution_streaming:
test_name: test_response_multi_turn_tool_execution_streaming
test_params:
case:
- case_id: "user_permissions_workflow"
input: "Help me with this security check: First, get the user ID for 'charlie', then get the permissions for that user ID, and finally check if that user can access 'secret_file.txt'. Stream your progress as you work through each step. Return only one tool call per step. Summarize the final result with a single 'yes' or 'no' response."
tools:
- type: mcp
server_label: "localmcp"
server_url: "<FILLED_BY_TEST_RUNNER>"
stream: true
output: "no"
- case_id: "experiment_analysis_streaming"
input: "I need a complete analysis: First, get the experiment ID for 'chemical_reaction', then get the results for that experiment, and tell me if the yield was above 80%. Return only one tool call per step. Please stream your analysis process."
tools:
- type: mcp
server_label: "localmcp"
server_url: "<FILLED_BY_TEST_RUNNER>"
stream: true
output: "85%"

View file

@ -0,0 +1,64 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
import time
def new_vector_store(openai_client, name):
"""Create a new vector store, cleaning up any existing one with the same name."""
# Ensure we don't reuse an existing vector store
vector_stores = openai_client.vector_stores.list()
for vector_store in vector_stores:
if vector_store.name == name:
openai_client.vector_stores.delete(vector_store_id=vector_store.id)
# Create a new vector store
vector_store = openai_client.vector_stores.create(name=name)
return vector_store
def upload_file(openai_client, name, file_path):
"""Upload a file, cleaning up any existing file with the same name."""
# Ensure we don't reuse an existing file
files = openai_client.files.list()
for file in files:
if file.filename == name:
openai_client.files.delete(file_id=file.id)
# Upload a text file with our document content
return openai_client.files.create(file=open(file_path, "rb"), purpose="assistants")
def wait_for_file_attachment(compat_client, vector_store_id, file_id):
"""Wait for a file to be attached to a vector store."""
file_attach_response = compat_client.vector_stores.files.retrieve(
vector_store_id=vector_store_id,
file_id=file_id,
)
while file_attach_response.status == "in_progress":
time.sleep(0.1)
file_attach_response = compat_client.vector_stores.files.retrieve(
vector_store_id=vector_store_id,
file_id=file_id,
)
assert file_attach_response.status == "completed", f"Expected file to be attached, got {file_attach_response}"
assert not file_attach_response.last_error
return file_attach_response
def setup_mcp_tools(tools, mcp_server_info):
"""Replace placeholder MCP server URLs with actual server info."""
# Create a deep copy to avoid modifying the original test case
import copy
tools_copy = copy.deepcopy(tools)
for tool in tools_copy:
if tool["type"] == "mcp" and tool["server_url"] == "<FILLED_BY_TEST_RUNNER>":
tool["server_url"] = mcp_server_info["server_url"]
return tools_copy

View file

@ -0,0 +1,145 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
from typing import Any
class StreamingValidator:
"""Helper class for validating streaming response events."""
def __init__(self, chunks: list[Any]):
self.chunks = chunks
self.event_types = [chunk.type for chunk in chunks]
def assert_basic_event_sequence(self):
"""Verify basic created -> completed event sequence."""
assert len(self.chunks) >= 2, f"Expected at least 2 chunks (created + completed), got {len(self.chunks)}"
assert self.chunks[0].type == "response.created", (
f"First chunk should be response.created, got {self.chunks[0].type}"
)
assert self.chunks[-1].type == "response.completed", (
f"Last chunk should be response.completed, got {self.chunks[-1].type}"
)
# Verify event order
created_index = self.event_types.index("response.created")
completed_index = self.event_types.index("response.completed")
assert created_index < completed_index, "response.created should come before response.completed"
def assert_response_consistency(self):
"""Verify response ID consistency across events."""
response_ids = set()
for chunk in self.chunks:
if hasattr(chunk, "response_id"):
response_ids.add(chunk.response_id)
elif hasattr(chunk, "response") and hasattr(chunk.response, "id"):
response_ids.add(chunk.response.id)
assert len(response_ids) == 1, f"All events should reference the same response_id, found: {response_ids}"
def assert_has_incremental_content(self):
"""Verify that content is delivered incrementally via delta events."""
delta_events = [
i for i, event_type in enumerate(self.event_types) if event_type == "response.output_text.delta"
]
assert len(delta_events) > 0, "Expected delta events for true incremental streaming, but found none"
# Verify delta events have content
non_empty_deltas = 0
delta_content_total = ""
for delta_idx in delta_events:
chunk = self.chunks[delta_idx]
if hasattr(chunk, "delta") and chunk.delta:
delta_content_total += chunk.delta
non_empty_deltas += 1
assert non_empty_deltas > 0, "Delta events found but none contain content"
assert len(delta_content_total) > 0, "Delta events found but total delta content is empty"
return delta_content_total
def assert_content_quality(self, expected_content: str):
"""Verify the final response contains expected content."""
final_chunk = self.chunks[-1]
if hasattr(final_chunk, "response"):
output_text = final_chunk.response.output_text.lower().strip()
assert len(output_text) > 0, "Response should have content"
assert expected_content.lower() in output_text, f"Expected '{expected_content}' in response"
def assert_has_tool_calls(self):
"""Verify tool call streaming events are present."""
# Check for tool call events
delta_events = [
chunk
for chunk in self.chunks
if chunk.type in ["response.function_call_arguments.delta", "response.mcp_call.arguments.delta"]
]
done_events = [
chunk
for chunk in self.chunks
if chunk.type in ["response.function_call_arguments.done", "response.mcp_call.arguments.done"]
]
assert len(delta_events) > 0, f"Expected tool call delta events, got chunk types: {self.event_types}"
assert len(done_events) > 0, f"Expected tool call done events, got chunk types: {self.event_types}"
# Verify output item events
item_added_events = [chunk for chunk in self.chunks if chunk.type == "response.output_item.added"]
item_done_events = [chunk for chunk in self.chunks if chunk.type == "response.output_item.done"]
assert len(item_added_events) > 0, (
f"Expected response.output_item.added events, got chunk types: {self.event_types}"
)
assert len(item_done_events) > 0, (
f"Expected response.output_item.done events, got chunk types: {self.event_types}"
)
def assert_has_mcp_events(self):
"""Verify MCP-specific streaming events are present."""
# Tool execution progress events
mcp_in_progress_events = [chunk for chunk in self.chunks if chunk.type == "response.mcp_call.in_progress"]
mcp_completed_events = [chunk for chunk in self.chunks if chunk.type == "response.mcp_call.completed"]
assert len(mcp_in_progress_events) > 0, (
f"Expected response.mcp_call.in_progress events, got chunk types: {self.event_types}"
)
assert len(mcp_completed_events) > 0, (
f"Expected response.mcp_call.completed events, got chunk types: {self.event_types}"
)
# MCP list tools events
mcp_list_tools_in_progress_events = [
chunk for chunk in self.chunks if chunk.type == "response.mcp_list_tools.in_progress"
]
mcp_list_tools_completed_events = [
chunk for chunk in self.chunks if chunk.type == "response.mcp_list_tools.completed"
]
assert len(mcp_list_tools_in_progress_events) > 0, (
f"Expected response.mcp_list_tools.in_progress events, got chunk types: {self.event_types}"
)
assert len(mcp_list_tools_completed_events) > 0, (
f"Expected response.mcp_list_tools.completed events, got chunk types: {self.event_types}"
)
def assert_rich_streaming(self, min_chunks: int = 10):
"""Verify we have substantial streaming activity."""
assert len(self.chunks) > min_chunks, (
f"Expected rich streaming with many events, got only {len(self.chunks)} chunks"
)
def validate_event_structure(self):
"""Validate the structure of various event types."""
for chunk in self.chunks:
if chunk.type == "response.created":
assert chunk.response.status == "in_progress"
elif chunk.type == "response.completed":
assert chunk.response.status == "completed"
elif hasattr(chunk, "item_id"):
assert chunk.item_id, "Events with item_id should have non-empty item_id"
elif hasattr(chunk, "sequence_number"):
assert isinstance(chunk.sequence_number, int), "sequence_number should be an integer"

View file

@ -0,0 +1,188 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
import time
import pytest
from fixtures.test_cases import basic_test_cases, image_test_cases, multi_turn_image_test_cases, multi_turn_test_cases
from streaming_assertions import StreamingValidator
@pytest.mark.parametrize("case", basic_test_cases)
def test_response_non_streaming_basic(compat_client, text_model_id, case):
response = compat_client.responses.create(
model=text_model_id,
input=case.input,
stream=False,
)
output_text = response.output_text.lower().strip()
assert len(output_text) > 0
assert case.expected.lower() in output_text
retrieved_response = compat_client.responses.retrieve(response_id=response.id)
assert retrieved_response.output_text == response.output_text
next_response = compat_client.responses.create(
model=text_model_id,
input="Repeat your previous response in all caps.",
previous_response_id=response.id,
)
next_output_text = next_response.output_text.strip()
assert case.expected.upper() in next_output_text
@pytest.mark.parametrize("case", basic_test_cases)
def test_response_streaming_basic(compat_client, text_model_id, case):
response = compat_client.responses.create(
model=text_model_id,
input=case.input,
stream=True,
)
# Track events and timing to verify proper streaming
events = []
event_times = []
response_id = ""
start_time = time.time()
for chunk in response:
current_time = time.time()
event_times.append(current_time - start_time)
events.append(chunk)
if chunk.type == "response.created":
# Verify response.created is emitted first and immediately
assert len(events) == 1, "response.created should be the first event"
assert event_times[0] < 0.1, "response.created should be emitted immediately"
assert chunk.response.status == "in_progress"
response_id = chunk.response.id
elif chunk.type == "response.completed":
# Verify response.completed comes after response.created
assert len(events) >= 2, "response.completed should come after response.created"
assert chunk.response.status == "completed"
assert chunk.response.id == response_id, "Response ID should be consistent"
# Verify content quality
output_text = chunk.response.output_text.lower().strip()
assert len(output_text) > 0, "Response should have content"
assert case.expected.lower() in output_text, f"Expected '{case.expected}' in response"
# Use validator for common checks
validator = StreamingValidator(events)
validator.assert_basic_event_sequence()
validator.assert_response_consistency()
# Verify stored response matches streamed response
retrieved_response = compat_client.responses.retrieve(response_id=response_id)
final_event = events[-1]
assert retrieved_response.output_text == final_event.response.output_text
@pytest.mark.parametrize("case", basic_test_cases)
def test_response_streaming_incremental_content(compat_client, text_model_id, case):
"""Test that streaming actually delivers content incrementally, not just at the end."""
response = compat_client.responses.create(
model=text_model_id,
input=case.input,
stream=True,
)
# Track all events and their content to verify incremental streaming
events = []
content_snapshots = []
event_times = []
start_time = time.time()
for chunk in response:
current_time = time.time()
event_times.append(current_time - start_time)
events.append(chunk)
# Track content at each event based on event type
if chunk.type == "response.output_text.delta":
# For delta events, track the delta content
content_snapshots.append(chunk.delta)
elif hasattr(chunk, "response") and hasattr(chunk.response, "output_text"):
# For response.created/completed events, track the full output_text
content_snapshots.append(chunk.response.output_text)
else:
content_snapshots.append("")
validator = StreamingValidator(events)
validator.assert_basic_event_sequence()
# Check if we have incremental content updates
event_types = [event.type for event in events]
created_index = event_types.index("response.created")
completed_index = event_types.index("response.completed")
# The key test: verify content progression
created_content = content_snapshots[created_index]
completed_content = content_snapshots[completed_index]
# Verify that response.created has empty or minimal content
assert len(created_content) == 0, f"response.created should have empty content, got: {repr(created_content[:100])}"
# Verify that response.completed has the full content
assert len(completed_content) > 0, "response.completed should have content"
assert case.expected.lower() in completed_content.lower(), f"Expected '{case.expected}' in final content"
# Use validator for incremental content checks
delta_content_total = validator.assert_has_incremental_content()
# Verify that the accumulated delta content matches the final content
assert delta_content_total.strip() == completed_content.strip(), (
f"Delta content '{delta_content_total}' should match final content '{completed_content}'"
)
# Verify timing: delta events should come between created and completed
delta_events = [i for i, event_type in enumerate(event_types) if event_type == "response.output_text.delta"]
for delta_idx in delta_events:
assert created_index < delta_idx < completed_index, (
f"Delta event at index {delta_idx} should be between created ({created_index}) and completed ({completed_index})"
)
@pytest.mark.parametrize("case", multi_turn_test_cases)
def test_response_non_streaming_multi_turn(compat_client, text_model_id, case):
previous_response_id = None
for turn_input, turn_expected in case.turns:
response = compat_client.responses.create(
model=text_model_id,
input=turn_input,
previous_response_id=previous_response_id,
)
previous_response_id = response.id
output_text = response.output_text.lower()
assert turn_expected.lower() in output_text
@pytest.mark.parametrize("case", image_test_cases)
def test_response_non_streaming_image(compat_client, text_model_id, case):
response = compat_client.responses.create(
model=text_model_id,
input=case.input,
stream=False,
)
output_text = response.output_text.lower()
assert case.expected.lower() in output_text
@pytest.mark.parametrize("case", multi_turn_image_test_cases)
def test_response_non_streaming_multi_turn_image(compat_client, text_model_id, case):
previous_response_id = None
for turn_input, turn_expected in case.turns:
response = compat_client.responses.create(
model=text_model_id,
input=turn_input,
previous_response_id=previous_response_id,
)
previous_response_id = response.id
output_text = response.output_text.lower()
assert turn_expected.lower() in output_text

View file

@ -0,0 +1,318 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
import json
import time
import pytest
from llama_stack import LlamaStackAsLibraryClient
from .helpers import new_vector_store, upload_file
@pytest.mark.parametrize(
"text_format",
# Not testing json_object because most providers don't actually support it.
[
{"type": "text"},
{
"type": "json_schema",
"name": "capitals",
"description": "A schema for the capital of each country",
"schema": {"type": "object", "properties": {"capital": {"type": "string"}}},
"strict": True,
},
],
)
def test_response_text_format(compat_client, text_model_id, text_format):
if isinstance(compat_client, LlamaStackAsLibraryClient):
pytest.skip("Responses API text format is not yet supported in library client.")
stream = False
response = compat_client.responses.create(
model=text_model_id,
input="What is the capital of France?",
stream=stream,
text={"format": text_format},
)
# by_alias=True is needed because otherwise Pydantic renames our "schema" field
assert response.text.format.model_dump(exclude_none=True, by_alias=True) == text_format
assert "paris" in response.output_text.lower()
if text_format["type"] == "json_schema":
assert "paris" in json.loads(response.output_text)["capital"].lower()
@pytest.fixture
def vector_store_with_filtered_files(compat_client, text_model_id, tmp_path_factory):
"""Create a vector store with multiple files that have different attributes for filtering tests."""
if isinstance(compat_client, LlamaStackAsLibraryClient):
pytest.skip("Responses API file search is not yet supported in library client.")
vector_store = new_vector_store(compat_client, "test_vector_store_with_filters")
tmp_path = tmp_path_factory.mktemp("filter_test_files")
# Create multiple files with different attributes
files_data = [
{
"name": "us_marketing_q1.txt",
"content": "US promotional campaigns for Q1 2023. Revenue increased by 15% in the US region.",
"attributes": {
"region": "us",
"category": "marketing",
"date": 1672531200, # Jan 1, 2023
},
},
{
"name": "us_engineering_q2.txt",
"content": "US technical updates for Q2 2023. New features deployed in the US region.",
"attributes": {
"region": "us",
"category": "engineering",
"date": 1680307200, # Apr 1, 2023
},
},
{
"name": "eu_marketing_q1.txt",
"content": "European advertising campaign results for Q1 2023. Strong growth in EU markets.",
"attributes": {
"region": "eu",
"category": "marketing",
"date": 1672531200, # Jan 1, 2023
},
},
{
"name": "asia_sales_q3.txt",
"content": "Asia Pacific revenue figures for Q3 2023. Record breaking quarter in Asia.",
"attributes": {
"region": "asia",
"category": "sales",
"date": 1688169600, # Jul 1, 2023
},
},
]
file_ids = []
for file_data in files_data:
# Create file
file_path = tmp_path / file_data["name"]
file_path.write_text(file_data["content"])
# Upload file
file_response = upload_file(compat_client, file_data["name"], str(file_path))
file_ids.append(file_response.id)
# Attach file to vector store with attributes
file_attach_response = compat_client.vector_stores.files.create(
vector_store_id=vector_store.id,
file_id=file_response.id,
attributes=file_data["attributes"],
)
# Wait for attachment
while file_attach_response.status == "in_progress":
time.sleep(0.1)
file_attach_response = compat_client.vector_stores.files.retrieve(
vector_store_id=vector_store.id,
file_id=file_response.id,
)
assert file_attach_response.status == "completed"
yield vector_store
# Cleanup: delete vector store and files
try:
compat_client.vector_stores.delete(vector_store_id=vector_store.id)
for file_id in file_ids:
try:
compat_client.files.delete(file_id=file_id)
except Exception:
pass # File might already be deleted
except Exception:
pass # Best effort cleanup
def test_response_file_search_filter_by_region(compat_client, text_model_id, vector_store_with_filtered_files):
"""Test file search with region equality filter."""
tools = [
{
"type": "file_search",
"vector_store_ids": [vector_store_with_filtered_files.id],
"filters": {"type": "eq", "key": "region", "value": "us"},
}
]
response = compat_client.responses.create(
model=text_model_id,
input="What are the updates from the US region?",
tools=tools,
stream=False,
include=["file_search_call.results"],
)
# Verify file search was called with US filter
assert len(response.output) > 1
assert response.output[0].type == "file_search_call"
assert response.output[0].status == "completed"
assert response.output[0].results
# Should only return US files (not EU or Asia files)
for result in response.output[0].results:
assert "us" in result.text.lower() or "US" in result.text
# Ensure non-US regions are NOT returned
assert "european" not in result.text.lower()
assert "asia" not in result.text.lower()
def test_response_file_search_filter_by_category(compat_client, text_model_id, vector_store_with_filtered_files):
"""Test file search with category equality filter."""
tools = [
{
"type": "file_search",
"vector_store_ids": [vector_store_with_filtered_files.id],
"filters": {"type": "eq", "key": "category", "value": "marketing"},
}
]
response = compat_client.responses.create(
model=text_model_id,
input="Show me all marketing reports",
tools=tools,
stream=False,
include=["file_search_call.results"],
)
assert response.output[0].type == "file_search_call"
assert response.output[0].status == "completed"
assert response.output[0].results
# Should only return marketing files (not engineering or sales)
for result in response.output[0].results:
# Marketing files should have promotional/advertising content
assert "promotional" in result.text.lower() or "advertising" in result.text.lower()
# Ensure non-marketing categories are NOT returned
assert "technical" not in result.text.lower()
assert "revenue figures" not in result.text.lower()
def test_response_file_search_filter_by_date_range(compat_client, text_model_id, vector_store_with_filtered_files):
"""Test file search with date range filter using compound AND."""
tools = [
{
"type": "file_search",
"vector_store_ids": [vector_store_with_filtered_files.id],
"filters": {
"type": "and",
"filters": [
{
"type": "gte",
"key": "date",
"value": 1672531200, # Jan 1, 2023
},
{
"type": "lt",
"key": "date",
"value": 1680307200, # Apr 1, 2023
},
],
},
}
]
response = compat_client.responses.create(
model=text_model_id,
input="What happened in Q1 2023?",
tools=tools,
stream=False,
include=["file_search_call.results"],
)
assert response.output[0].type == "file_search_call"
assert response.output[0].status == "completed"
assert response.output[0].results
# Should only return Q1 files (not Q2 or Q3)
for result in response.output[0].results:
assert "q1" in result.text.lower()
# Ensure non-Q1 quarters are NOT returned
assert "q2" not in result.text.lower()
assert "q3" not in result.text.lower()
def test_response_file_search_filter_compound_and(compat_client, text_model_id, vector_store_with_filtered_files):
"""Test file search with compound AND filter (region AND category)."""
tools = [
{
"type": "file_search",
"vector_store_ids": [vector_store_with_filtered_files.id],
"filters": {
"type": "and",
"filters": [
{"type": "eq", "key": "region", "value": "us"},
{"type": "eq", "key": "category", "value": "engineering"},
],
},
}
]
response = compat_client.responses.create(
model=text_model_id,
input="What are the engineering updates from the US?",
tools=tools,
stream=False,
include=["file_search_call.results"],
)
assert response.output[0].type == "file_search_call"
assert response.output[0].status == "completed"
assert response.output[0].results
# Should only return US engineering files
assert len(response.output[0].results) >= 1
for result in response.output[0].results:
assert "us" in result.text.lower() and "technical" in result.text.lower()
# Ensure it's not from other regions or categories
assert "european" not in result.text.lower() and "asia" not in result.text.lower()
assert "promotional" not in result.text.lower() and "revenue" not in result.text.lower()
def test_response_file_search_filter_compound_or(compat_client, text_model_id, vector_store_with_filtered_files):
"""Test file search with compound OR filter (marketing OR sales)."""
tools = [
{
"type": "file_search",
"vector_store_ids": [vector_store_with_filtered_files.id],
"filters": {
"type": "or",
"filters": [
{"type": "eq", "key": "category", "value": "marketing"},
{"type": "eq", "key": "category", "value": "sales"},
],
},
}
]
response = compat_client.responses.create(
model=text_model_id,
input="Show me marketing and sales documents",
tools=tools,
stream=False,
include=["file_search_call.results"],
)
assert response.output[0].type == "file_search_call"
assert response.output[0].status == "completed"
assert response.output[0].results
# Should return marketing and sales files, but NOT engineering
categories_found = set()
for result in response.output[0].results:
text_lower = result.text.lower()
if "promotional" in text_lower or "advertising" in text_lower:
categories_found.add("marketing")
if "revenue figures" in text_lower:
categories_found.add("sales")
# Ensure engineering files are NOT returned
assert "technical" not in text_lower, f"Engineering file should not be returned, but got: {result.text}"
# Verify we got at least one of the expected categories
assert len(categories_found) > 0, "Should have found at least one marketing or sales file"
assert categories_found.issubset({"marketing", "sales"}), f"Found unexpected categories: {categories_found}"

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,335 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
import json
import os
import httpx
import openai
import pytest
from fixtures.test_cases import (
custom_tool_test_cases,
file_search_test_cases,
mcp_tool_test_cases,
multi_turn_tool_execution_streaming_test_cases,
multi_turn_tool_execution_test_cases,
web_search_test_cases,
)
from helpers import new_vector_store, setup_mcp_tools, upload_file, wait_for_file_attachment
from streaming_assertions import StreamingValidator
from llama_stack import LlamaStackAsLibraryClient
from llama_stack.core.datatypes import AuthenticationRequiredError
from tests.common.mcp import dependency_tools, make_mcp_server
@pytest.mark.parametrize("case", web_search_test_cases)
def test_response_non_streaming_web_search(compat_client, text_model_id, case):
response = compat_client.responses.create(
model=text_model_id,
input=case.input,
tools=case.tools,
stream=False,
)
assert len(response.output) > 1
assert response.output[0].type == "web_search_call"
assert response.output[0].status == "completed"
assert response.output[1].type == "message"
assert response.output[1].status == "completed"
assert response.output[1].role == "assistant"
assert len(response.output[1].content) > 0
assert case.expected.lower() in response.output_text.lower().strip()
@pytest.mark.parametrize("case", file_search_test_cases)
def test_response_non_streaming_file_search(compat_client, text_model_id, tmp_path, case):
if isinstance(compat_client, LlamaStackAsLibraryClient):
pytest.skip("Responses API file search is not yet supported in library client.")
vector_store = new_vector_store(compat_client, "test_vector_store")
if case.file_content:
file_name = "test_response_non_streaming_file_search.txt"
file_path = tmp_path / file_name
file_path.write_text(case.file_content)
elif case.file_path:
file_path = os.path.join(os.path.dirname(__file__), "fixtures", case.file_path)
file_name = os.path.basename(file_path)
else:
raise ValueError("No file content or path provided for case")
file_response = upload_file(compat_client, file_name, file_path)
# Attach our file to the vector store
compat_client.vector_stores.files.create(
vector_store_id=vector_store.id,
file_id=file_response.id,
)
# Wait for the file to be attached
wait_for_file_attachment(compat_client, vector_store.id, file_response.id)
# Update our tools with the right vector store id
tools = case.tools
for tool in tools:
if tool["type"] == "file_search":
tool["vector_store_ids"] = [vector_store.id]
# Create the response request, which should query our vector store
response = compat_client.responses.create(
model=text_model_id,
input=case.input,
tools=tools,
stream=False,
include=["file_search_call.results"],
)
# Verify the file_search_tool was called
assert len(response.output) > 1
assert response.output[0].type == "file_search_call"
assert response.output[0].status == "completed"
assert response.output[0].queries # ensure it's some non-empty list
assert response.output[0].results
assert case.expected.lower() in response.output[0].results[0].text.lower()
assert response.output[0].results[0].score > 0
# Verify the output_text generated by the response
assert case.expected.lower() in response.output_text.lower().strip()
def test_response_non_streaming_file_search_empty_vector_store(compat_client, text_model_id):
if isinstance(compat_client, LlamaStackAsLibraryClient):
pytest.skip("Responses API file search is not yet supported in library client.")
vector_store = new_vector_store(compat_client, "test_vector_store")
# Create the response request, which should query our vector store
response = compat_client.responses.create(
model=text_model_id,
input="How many experts does the Llama 4 Maverick model have?",
tools=[{"type": "file_search", "vector_store_ids": [vector_store.id]}],
stream=False,
include=["file_search_call.results"],
)
# Verify the file_search_tool was called
assert len(response.output) > 1
assert response.output[0].type == "file_search_call"
assert response.output[0].status == "completed"
assert response.output[0].queries # ensure it's some non-empty list
assert not response.output[0].results # ensure we don't get any results
# Verify some output_text was generated by the response
assert response.output_text
@pytest.mark.parametrize("case", mcp_tool_test_cases)
def test_response_non_streaming_mcp_tool(compat_client, text_model_id, case):
if not isinstance(compat_client, LlamaStackAsLibraryClient):
pytest.skip("in-process MCP server is only supported in library client")
with make_mcp_server() as mcp_server_info:
tools = setup_mcp_tools(case.tools, mcp_server_info)
response = compat_client.responses.create(
model=text_model_id,
input=case.input,
tools=tools,
stream=False,
)
assert len(response.output) >= 3
list_tools = response.output[0]
assert list_tools.type == "mcp_list_tools"
assert list_tools.server_label == "localmcp"
assert len(list_tools.tools) == 2
assert {t.name for t in list_tools.tools} == {
"get_boiling_point",
"greet_everyone",
}
call = response.output[1]
assert call.type == "mcp_call"
assert call.name == "get_boiling_point"
assert json.loads(call.arguments) == {
"liquid_name": "myawesomeliquid",
"celsius": True,
}
assert call.error is None
assert "-100" in call.output
# sometimes the model will call the tool again, so we need to get the last message
message = response.output[-1]
text_content = message.content[0].text
assert "boiling point" in text_content.lower()
with make_mcp_server(required_auth_token="test-token") as mcp_server_info:
tools = setup_mcp_tools(case.tools, mcp_server_info)
exc_type = (
AuthenticationRequiredError
if isinstance(compat_client, LlamaStackAsLibraryClient)
else (httpx.HTTPStatusError, openai.AuthenticationError)
)
with pytest.raises(exc_type):
compat_client.responses.create(
model=text_model_id,
input=case.input,
tools=tools,
stream=False,
)
for tool in tools:
if tool["type"] == "mcp":
tool["headers"] = {"Authorization": "Bearer test-token"}
response = compat_client.responses.create(
model=text_model_id,
input=case.input,
tools=tools,
stream=False,
)
assert len(response.output) >= 3
@pytest.mark.parametrize("case", custom_tool_test_cases)
def test_response_non_streaming_custom_tool(compat_client, text_model_id, case):
response = compat_client.responses.create(
model=text_model_id,
input=case.input,
tools=case.tools,
stream=False,
)
assert len(response.output) == 1
assert response.output[0].type == "function_call"
assert response.output[0].status == "completed"
assert response.output[0].name == "get_weather"
@pytest.mark.parametrize("case", multi_turn_tool_execution_test_cases)
def test_response_non_streaming_multi_turn_tool_execution(compat_client, text_model_id, case):
"""Test multi-turn tool execution where multiple MCP tool calls are performed in sequence."""
if not isinstance(compat_client, LlamaStackAsLibraryClient):
pytest.skip("in-process MCP server is only supported in library client")
with make_mcp_server(tools=dependency_tools()) as mcp_server_info:
tools = setup_mcp_tools(case.tools, mcp_server_info)
response = compat_client.responses.create(
input=case.input,
model=text_model_id,
tools=tools,
)
# Verify we have MCP tool calls in the output
mcp_list_tools = [output for output in response.output if output.type == "mcp_list_tools"]
mcp_calls = [output for output in response.output if output.type == "mcp_call"]
message_outputs = [output for output in response.output if output.type == "message"]
# Should have exactly 1 MCP list tools message (at the beginning)
assert len(mcp_list_tools) == 1, f"Expected exactly 1 mcp_list_tools, got {len(mcp_list_tools)}"
assert mcp_list_tools[0].server_label == "localmcp"
assert len(mcp_list_tools[0].tools) == 5 # Updated for dependency tools
expected_tool_names = {
"get_user_id",
"get_user_permissions",
"check_file_access",
"get_experiment_id",
"get_experiment_results",
}
assert {t.name for t in mcp_list_tools[0].tools} == expected_tool_names
assert len(mcp_calls) >= 1, f"Expected at least 1 mcp_call, got {len(mcp_calls)}"
for mcp_call in mcp_calls:
assert mcp_call.error is None, f"MCP call should not have errors, got: {mcp_call.error}"
assert len(message_outputs) >= 1, f"Expected at least 1 message output, got {len(message_outputs)}"
final_message = message_outputs[-1]
assert final_message.role == "assistant", f"Final message should be from assistant, got {final_message.role}"
assert final_message.status == "completed", f"Final message should be completed, got {final_message.status}"
assert len(final_message.content) > 0, "Final message should have content"
expected_output = case.expected
assert expected_output.lower() in response.output_text.lower(), (
f"Expected '{expected_output}' to appear in response: {response.output_text}"
)
@pytest.mark.parametrize("case", multi_turn_tool_execution_streaming_test_cases)
def test_response_streaming_multi_turn_tool_execution(compat_client, text_model_id, case):
"""Test streaming multi-turn tool execution where multiple MCP tool calls are performed in sequence."""
if not isinstance(compat_client, LlamaStackAsLibraryClient):
pytest.skip("in-process MCP server is only supported in library client")
with make_mcp_server(tools=dependency_tools()) as mcp_server_info:
tools = setup_mcp_tools(case.tools, mcp_server_info)
stream = compat_client.responses.create(
input=case.input,
model=text_model_id,
tools=tools,
stream=True,
)
chunks = []
for chunk in stream:
chunks.append(chunk)
# Use validator for common streaming checks
validator = StreamingValidator(chunks)
validator.assert_basic_event_sequence()
validator.assert_response_consistency()
validator.assert_has_tool_calls()
validator.assert_has_mcp_events()
validator.assert_rich_streaming()
# Get the final response from the last chunk
final_chunk = chunks[-1]
if hasattr(final_chunk, "response"):
final_response = final_chunk.response
# Verify multi-turn MCP tool execution results
mcp_list_tools = [output for output in final_response.output if output.type == "mcp_list_tools"]
mcp_calls = [output for output in final_response.output if output.type == "mcp_call"]
message_outputs = [output for output in final_response.output if output.type == "message"]
# Should have exactly 1 MCP list tools message (at the beginning)
assert len(mcp_list_tools) == 1, f"Expected exactly 1 mcp_list_tools, got {len(mcp_list_tools)}"
assert mcp_list_tools[0].server_label == "localmcp"
assert len(mcp_list_tools[0].tools) == 5 # Updated for dependency tools
expected_tool_names = {
"get_user_id",
"get_user_permissions",
"check_file_access",
"get_experiment_id",
"get_experiment_results",
}
assert {t.name for t in mcp_list_tools[0].tools} == expected_tool_names
# Should have at least 1 MCP call (the model should call at least one tool)
assert len(mcp_calls) >= 1, f"Expected at least 1 mcp_call, got {len(mcp_calls)}"
# All MCP calls should be completed (verifies our tool execution works)
for mcp_call in mcp_calls:
assert mcp_call.error is None, f"MCP call should not have errors, got: {mcp_call.error}"
# Should have at least one final message response
assert len(message_outputs) >= 1, f"Expected at least 1 message output, got {len(message_outputs)}"
# Final message should be from assistant and completed
final_message = message_outputs[-1]
assert final_message.role == "assistant", (
f"Final message should be from assistant, got {final_message.role}"
)
assert final_message.status == "completed", f"Final message should be completed, got {final_message.status}"
assert len(final_message.content) > 0, "Final message should have content"
# Check that the expected output appears in the response
expected_output = case.expected
assert expected_output.lower() in final_response.output_text.lower(), (
f"Expected '{expected_output}' to appear in response: {final_response.output_text}"
)

View file

@ -0,0 +1,56 @@
{
"request": {
"method": "POST",
"url": "http://0.0.0.0:11434/v1/v1/chat/completions",
"headers": {},
"body": {
"model": "llama3.2:3b-instruct-fp16",
"messages": [
{
"role": "user",
"content": "Quick test"
}
],
"max_tokens": 5
},
"endpoint": "/v1/chat/completions",
"model": "llama3.2:3b-instruct-fp16"
},
"response": {
"body": {
"__type__": "openai.types.chat.chat_completion.ChatCompletion",
"__data__": {
"id": "chatcmpl-651",
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"message": {
"content": "I'm ready to help",
"refusal": null,
"role": "assistant",
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": null
}
}
],
"created": 1755294941,
"model": "llama3.2:3b-instruct-fp16",
"object": "chat.completion",
"service_tier": null,
"system_fingerprint": "fp_ollama",
"usage": {
"completion_tokens": 5,
"prompt_tokens": 27,
"total_tokens": 32,
"completion_tokens_details": null,
"prompt_tokens_details": null
}
}
},
"is_streaming": false
}
}

View file

@ -0,0 +1,56 @@
{
"request": {
"method": "POST",
"url": "http://0.0.0.0:11434/v1/v1/chat/completions",
"headers": {},
"body": {
"model": "llama3.2:3b-instruct-fp16",
"messages": [
{
"role": "user",
"content": "Say hello"
}
],
"max_tokens": 20
},
"endpoint": "/v1/chat/completions",
"model": "llama3.2:3b-instruct-fp16"
},
"response": {
"body": {
"__type__": "openai.types.chat.chat_completion.ChatCompletion",
"__data__": {
"id": "chatcmpl-987",
"choices": [
{
"finish_reason": "length",
"index": 0,
"logprobs": null,
"message": {
"content": "Hello! It's nice to meet you. Is there something I can help you with or would you",
"refusal": null,
"role": "assistant",
"annotations": null,
"audio": null,
"function_call": null,
"tool_calls": null
}
}
],
"created": 1755294921,
"model": "llama3.2:3b-instruct-fp16",
"object": "chat.completion",
"service_tier": null,
"system_fingerprint": "fp_ollama",
"usage": {
"completion_tokens": 20,
"prompt_tokens": 27,
"total_tokens": 47,
"completion_tokens_details": null,
"prompt_tokens_details": null
}
}
},
"is_streaming": false
}
}

View file

@ -14,7 +14,7 @@
"models": [
{
"model": "nomic-embed-text:latest",
"modified_at": "2025-08-05T14:04:07.946926-07:00",
"modified_at": "2025-08-18T12:47:56.732989-07:00",
"digest": "0a109f422b47e3a30ba2b10eca18548e944e8a23073ee3f3e947efcf3c45e59f",
"size": 274302450,
"details": {

View file

@ -21,7 +21,7 @@
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-04T22:55:14.141947Z",
"created_at": "2025-08-15T20:24:49.18651486Z",
"done": false,
"done_reason": null,
"total_duration": null,
@ -39,7 +39,7 @@
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-04T22:55:14.194979Z",
"created_at": "2025-08-15T20:24:49.370611348Z",
"done": false,
"done_reason": null,
"total_duration": null,
@ -57,7 +57,7 @@
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-04T22:55:14.248312Z",
"created_at": "2025-08-15T20:24:49.557000029Z",
"done": false,
"done_reason": null,
"total_duration": null,
@ -75,7 +75,7 @@
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-04T22:55:14.301911Z",
"created_at": "2025-08-15T20:24:49.746777116Z",
"done": false,
"done_reason": null,
"total_duration": null,
@ -93,7 +93,7 @@
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-04T22:55:14.354437Z",
"created_at": "2025-08-15T20:24:49.942233333Z",
"done": false,
"done_reason": null,
"total_duration": null,
@ -111,7 +111,7 @@
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-04T22:55:14.406821Z",
"created_at": "2025-08-15T20:24:50.126788846Z",
"done": false,
"done_reason": null,
"total_duration": null,
@ -129,7 +129,7 @@
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-04T22:55:14.457633Z",
"created_at": "2025-08-15T20:24:50.311346131Z",
"done": false,
"done_reason": null,
"total_duration": null,
@ -147,7 +147,7 @@
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-04T22:55:14.507857Z",
"created_at": "2025-08-15T20:24:50.501507173Z",
"done": false,
"done_reason": null,
"total_duration": null,
@ -165,7 +165,7 @@
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-04T22:55:14.558847Z",
"created_at": "2025-08-15T20:24:50.692296777Z",
"done": false,
"done_reason": null,
"total_duration": null,
@ -183,7 +183,7 @@
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-04T22:55:14.609969Z",
"created_at": "2025-08-15T20:24:50.878846539Z",
"done": false,
"done_reason": null,
"total_duration": null,
@ -201,15 +201,15 @@
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-04T22:55:14.660997Z",
"created_at": "2025-08-15T20:24:51.063200561Z",
"done": true,
"done_reason": "stop",
"total_duration": 715356542,
"load_duration": 59747500,
"total_duration": 33982453650,
"load_duration": 2909001805,
"prompt_eval_count": 341,
"prompt_eval_duration": 128000000,
"prompt_eval_duration": 29194357307,
"eval_count": 11,
"eval_duration": 526000000,
"eval_duration": 1878247732,
"response": "",
"thinking": null,
"context": null

File diff suppressed because it is too large Load diff

View file

@ -0,0 +1,203 @@
{
"request": {
"method": "POST",
"url": "http://localhost:11434/api/generate",
"headers": {},
"body": {
"model": "llama3.2:3b-instruct-fp16",
"raw": true,
"prompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nGive me a sentence that contains the word: hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
"options": {
"temperature": 0.0
},
"stream": true
},
"endpoint": "/api/generate",
"model": "llama3.2:3b-instruct-fp16"
},
"response": {
"body": [
{
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-18T19:47:58.267146Z",
"done": false,
"done_reason": null,
"total_duration": null,
"load_duration": null,
"prompt_eval_count": null,
"prompt_eval_duration": null,
"eval_count": null,
"eval_duration": null,
"response": "Hello",
"thinking": null,
"context": null
}
},
{
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-18T19:47:58.309006Z",
"done": false,
"done_reason": null,
"total_duration": null,
"load_duration": null,
"prompt_eval_count": null,
"prompt_eval_duration": null,
"eval_count": null,
"eval_duration": null,
"response": ",",
"thinking": null,
"context": null
}
},
{
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-18T19:47:58.351179Z",
"done": false,
"done_reason": null,
"total_duration": null,
"load_duration": null,
"prompt_eval_count": null,
"prompt_eval_duration": null,
"eval_count": null,
"eval_duration": null,
"response": " how",
"thinking": null,
"context": null
}
},
{
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-18T19:47:58.393262Z",
"done": false,
"done_reason": null,
"total_duration": null,
"load_duration": null,
"prompt_eval_count": null,
"prompt_eval_duration": null,
"eval_count": null,
"eval_duration": null,
"response": " can",
"thinking": null,
"context": null
}
},
{
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-18T19:47:58.436079Z",
"done": false,
"done_reason": null,
"total_duration": null,
"load_duration": null,
"prompt_eval_count": null,
"prompt_eval_duration": null,
"eval_count": null,
"eval_duration": null,
"response": " I",
"thinking": null,
"context": null
}
},
{
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-18T19:47:58.478393Z",
"done": false,
"done_reason": null,
"total_duration": null,
"load_duration": null,
"prompt_eval_count": null,
"prompt_eval_duration": null,
"eval_count": null,
"eval_duration": null,
"response": " assist",
"thinking": null,
"context": null
}
},
{
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-18T19:47:58.520608Z",
"done": false,
"done_reason": null,
"total_duration": null,
"load_duration": null,
"prompt_eval_count": null,
"prompt_eval_duration": null,
"eval_count": null,
"eval_duration": null,
"response": " you",
"thinking": null,
"context": null
}
},
{
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-18T19:47:58.562885Z",
"done": false,
"done_reason": null,
"total_duration": null,
"load_duration": null,
"prompt_eval_count": null,
"prompt_eval_duration": null,
"eval_count": null,
"eval_duration": null,
"response": " today",
"thinking": null,
"context": null
}
},
{
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-18T19:47:58.604683Z",
"done": false,
"done_reason": null,
"total_duration": null,
"load_duration": null,
"prompt_eval_count": null,
"prompt_eval_duration": null,
"eval_count": null,
"eval_duration": null,
"response": "?",
"thinking": null,
"context": null
}
},
{
"__type__": "ollama._types.GenerateResponse",
"__data__": {
"model": "llama3.2:3b-instruct-fp16",
"created_at": "2025-08-18T19:47:58.646586Z",
"done": true,
"done_reason": "stop",
"total_duration": 1011323917,
"load_duration": 76575458,
"prompt_eval_count": 31,
"prompt_eval_duration": 553259250,
"eval_count": 10,
"eval_duration": 380302792,
"response": "",
"thinking": null,
"context": null
}
}
],
"is_streaming": true
}
}

View file

@ -13,12 +13,12 @@
"__data__": {
"models": [
{
"model": "llama3.2:3b",
"name": "llama3.2:3b",
"digest": "a80c4f17acd55265feec403c7aef86be0c25983ab279d83f3bcd3abbcb5b8b72",
"expires_at": "2025-08-06T15:57:21.573326-04:00",
"size": 4030033920,
"size_vram": 4030033920,
"model": "llama3.2:3b-instruct-fp16",
"name": "llama3.2:3b-instruct-fp16",
"digest": "195a8c01d91ec3cb1e0aad4624a51f2602c51fa7d96110f8ab5a20c84081804d",
"expires_at": "2025-08-18T13:47:44.262256-07:00",
"size": 7919570944,
"size_vram": 7919570944,
"details": {
"parent_model": "",
"format": "gguf",
@ -27,7 +27,7 @@
"llama"
],
"parameter_size": "3.2B",
"quantization_level": "Q4_K_M"
"quantization_level": "F16"
}
}
]

View file

@ -0,0 +1,109 @@
{
"request": {
"method": "POST",
"url": "http://localhost:11434/v1/v1/chat/completions",
"headers": {},
"body": {
"model": "llama3.2:3b-instruct-fp16",
"messages": [
{
"role": "user",
"content": "What's the weather in Tokyo? YOU MUST USE THE get_weather function to get the weather."
}
],
"response_format": {
"type": "text"
},
"stream": true,
"tools": [
{
"type": "function",
"function": {
"type": "function",
"name": "get_weather",
"description": "Get the weather in a given city",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city to get the weather for"
}
}
},
"strict": null
}
}
]
},
"endpoint": "/v1/chat/completions",
"model": "llama3.2:3b-instruct-fp16"
},
"response": {
"body": [
{
"__type__": "openai.types.chat.chat_completion_chunk.ChatCompletionChunk",
"__data__": {
"id": "chatcmpl-620",
"choices": [
{
"delta": {
"content": "",
"function_call": null,
"refusal": null,
"role": "assistant",
"tool_calls": [
{
"index": 0,
"id": "call_490d5ur7",
"function": {
"arguments": "{\"city\":\"Tokyo\"}",
"name": "get_weather"
},
"type": "function"
}
]
},
"finish_reason": null,
"index": 0,
"logprobs": null
}
],
"created": 1755228972,
"model": "llama3.2:3b-instruct-fp16",
"object": "chat.completion.chunk",
"service_tier": null,
"system_fingerprint": "fp_ollama",
"usage": null
}
},
{
"__type__": "openai.types.chat.chat_completion_chunk.ChatCompletionChunk",
"__data__": {
"id": "chatcmpl-620",
"choices": [
{
"delta": {
"content": "",
"function_call": null,
"refusal": null,
"role": "assistant",
"tool_calls": null
},
"finish_reason": "tool_calls",
"index": 0,
"logprobs": null
}
],
"created": 1755228972,
"model": "llama3.2:3b-instruct-fp16",
"object": "chat.completion.chunk",
"service_tier": null,
"system_fingerprint": "fp_ollama",
"usage": null
}
}
],
"is_streaming": true
}
}

View file

@ -27,10 +27,10 @@ def data_url_from_image(file_path):
@pytest.fixture(scope="session")
def code_scanner_shield_id(available_shields):
if "CodeScanner" in available_shields:
return "CodeScanner"
if "code-scanner" in available_shields:
return "code-scanner"
pytest.skip("CodeScanner shield is not available. Skipping.")
pytest.skip("code-scanner shield is not available. Skipping.")
def test_unsafe_examples(client_with_models, shield_id):
@ -163,6 +163,60 @@ def test_safety_with_code_scanner(client_with_models, code_scanner_shield_id, mo
assert response.violation.user_message == "Sorry, I found security concerns in the code."
def test_unsafe_input_with_code_scanner_run_moderation(client_with_models, code_scanner_shield_id, model_providers):
code_safety_enabled = len(CODE_SCANNER_ENABLED_PROVIDERS & model_providers) > 0
if not code_safety_enabled:
pytest.skip(f"Testing code scanner shields is not supported for model_providers {model_providers}")
insecure_code = [
"""
Here is an example of code that hashes a password using MD5:
```
import hashlib
password = 'securepassword123'.encode()
hashed_password = hashlib.md5(password).hexdigest()
print("Insecure hashed password:", hashed_password)
```
"""
]
moderation_object = client_with_models.moderations.create(
input=insecure_code,
model=code_scanner_shield_id,
)
assert moderation_object.results[0].flagged is True, f"Code scanner should have flagged {insecure_code} as insecure"
assert all(value is True for value in moderation_object.results[0].categories.values()), (
"Code scanner shield should have detected code insecure category"
)
def test_safe_input_with_code_scanner_run_moderation(client_with_models, code_scanner_shield_id, model_providers):
code_safety_enabled = len(CODE_SCANNER_ENABLED_PROVIDERS & model_providers) > 0
if not code_safety_enabled:
pytest.skip(f"Testing code scanner shields is not supported for model_providers {model_providers}")
secure_code = [
"""
Extract the first 5 characters from a string:
```
text = "Hello World"
first_five = text[:5]
print(first_five) # Output: "Hello"
# Safe handling for strings shorter than 5 characters
def get_first_five(text):
return text[:5] if text else ""
```
"""
]
moderation_object = client_with_models.moderations.create(
input=secure_code,
model=code_scanner_shield_id,
)
assert moderation_object.results[0].flagged is False, "Code scanner should not have flagged the code as insecure"
# We can use an instance of the LlamaGuard shield to detect attempts to misuse
# the interpreter as this is one of the existing categories it checks for
def test_safety_with_code_interpreter_abuse(client_with_models, shield_id):