llama-stack-mirror/tests/verifications/REPORT.md
ehhuang 32e3da7392
test(verification): more tests, multiturn tool use tests (#1954)
# What does this PR do?


## Test Plan
(myenv) ➜ llama-stack python tests/verifications/generate_report.py
--providers fireworks,together,openai --run-tests

f27f617629/tests/verifications/REPORT.md
2025-04-14 18:45:22 -07:00

8 KiB

Test Results Report

Generated on: 2025-04-14 18:11:37

This report was generated by running python tests/verifications/generate_report.py

Legend

  • - Test passed
  • - Test failed
  • - Test not applicable or not run for this model

Summary

Provider Pass Rate Tests Passed Total Tests
Together 48.7% 37 76
Fireworks 47.4% 36 76
Openai 100.0% 52 52

Together

Tests run on: 2025-04-14 18:08:14

# Run all tests for this provider:
pytest tests/verifications/openai_api/test_chat_completion.py --provider=together -v

# Example: Run only the 'earth' case of test_chat_non_streaming_basic:
pytest tests/verifications/openai_api/test_chat_completion.py --provider=together -k "test_chat_non_streaming_basic and earth"

Model Key (Together)

Display Name Full Model ID
Llama-3.3-70B-Instruct meta-llama/Llama-3.3-70B-Instruct-Turbo
Llama-4-Maverick-Instruct meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
Llama-4-Scout-Instruct meta-llama/Llama-4-Scout-17B-16E-Instruct
Test Llama-3.3-70B-Instruct Llama-4-Maverick-Instruct Llama-4-Scout-Instruct
test_chat_non_streaming_basic (earth)
test_chat_non_streaming_basic (saturn)
test_chat_non_streaming_image
test_chat_non_streaming_multi_turn_tool_calling (add_product_tool)
test_chat_non_streaming_multi_turn_tool_calling (compare_monthly_expense_tool)
test_chat_non_streaming_multi_turn_tool_calling (get_then_create_event_tool)
test_chat_non_streaming_multi_turn_tool_calling (text_then_weather_tool)
test_chat_non_streaming_multi_turn_tool_calling (weather_tool_then_text)
test_chat_non_streaming_structured_output (calendar)
test_chat_non_streaming_structured_output (math)
test_chat_non_streaming_tool_calling
test_chat_non_streaming_tool_choice_none
test_chat_non_streaming_tool_choice_required
test_chat_streaming_basic (earth)
test_chat_streaming_basic (saturn)
test_chat_streaming_image
test_chat_streaming_multi_turn_tool_calling (add_product_tool)
test_chat_streaming_multi_turn_tool_calling (compare_monthly_expense_tool)
test_chat_streaming_multi_turn_tool_calling (get_then_create_event_tool)
test_chat_streaming_multi_turn_tool_calling (text_then_weather_tool)
test_chat_streaming_multi_turn_tool_calling (weather_tool_then_text)
test_chat_streaming_structured_output (calendar)
test_chat_streaming_structured_output (math)
test_chat_streaming_tool_calling
test_chat_streaming_tool_choice_none
test_chat_streaming_tool_choice_required

Fireworks

Tests run on: 2025-04-14 18:04:06

# Run all tests for this provider:
pytest tests/verifications/openai_api/test_chat_completion.py --provider=fireworks -v

# Example: Run only the 'earth' case of test_chat_non_streaming_basic:
pytest tests/verifications/openai_api/test_chat_completion.py --provider=fireworks -k "test_chat_non_streaming_basic and earth"

Model Key (Fireworks)

Display Name Full Model ID
Llama-3.3-70B-Instruct accounts/fireworks/models/llama-v3p3-70b-instruct
Llama-4-Maverick-Instruct accounts/fireworks/models/llama4-maverick-instruct-basic
Llama-4-Scout-Instruct accounts/fireworks/models/llama4-scout-instruct-basic
Test Llama-3.3-70B-Instruct Llama-4-Maverick-Instruct Llama-4-Scout-Instruct
test_chat_non_streaming_basic (earth)
test_chat_non_streaming_basic (saturn)
test_chat_non_streaming_image
test_chat_non_streaming_multi_turn_tool_calling (add_product_tool)
test_chat_non_streaming_multi_turn_tool_calling (compare_monthly_expense_tool)
test_chat_non_streaming_multi_turn_tool_calling (get_then_create_event_tool)
test_chat_non_streaming_multi_turn_tool_calling (text_then_weather_tool)
test_chat_non_streaming_multi_turn_tool_calling (weather_tool_then_text)
test_chat_non_streaming_structured_output (calendar)
test_chat_non_streaming_structured_output (math)
test_chat_non_streaming_tool_calling
test_chat_non_streaming_tool_choice_none
test_chat_non_streaming_tool_choice_required
test_chat_streaming_basic (earth)
test_chat_streaming_basic (saturn)
test_chat_streaming_image
test_chat_streaming_multi_turn_tool_calling (add_product_tool)
test_chat_streaming_multi_turn_tool_calling (compare_monthly_expense_tool)
test_chat_streaming_multi_turn_tool_calling (get_then_create_event_tool)
test_chat_streaming_multi_turn_tool_calling (text_then_weather_tool)
test_chat_streaming_multi_turn_tool_calling (weather_tool_then_text)
test_chat_streaming_structured_output (calendar)
test_chat_streaming_structured_output (math)
test_chat_streaming_tool_calling
test_chat_streaming_tool_choice_none
test_chat_streaming_tool_choice_required

Openai

Tests run on: 2025-04-14 18:09:51

# Run all tests for this provider:
pytest tests/verifications/openai_api/test_chat_completion.py --provider=openai -v

# Example: Run only the 'earth' case of test_chat_non_streaming_basic:
pytest tests/verifications/openai_api/test_chat_completion.py --provider=openai -k "test_chat_non_streaming_basic and earth"

Model Key (Openai)

Display Name Full Model ID
gpt-4o gpt-4o
gpt-4o-mini gpt-4o-mini
Test gpt-4o gpt-4o-mini
test_chat_non_streaming_basic (earth)
test_chat_non_streaming_basic (saturn)
test_chat_non_streaming_image
test_chat_non_streaming_multi_turn_tool_calling (add_product_tool)
test_chat_non_streaming_multi_turn_tool_calling (compare_monthly_expense_tool)
test_chat_non_streaming_multi_turn_tool_calling (get_then_create_event_tool)
test_chat_non_streaming_multi_turn_tool_calling (text_then_weather_tool)
test_chat_non_streaming_multi_turn_tool_calling (weather_tool_then_text)
test_chat_non_streaming_structured_output (calendar)
test_chat_non_streaming_structured_output (math)
test_chat_non_streaming_tool_calling
test_chat_non_streaming_tool_choice_none
test_chat_non_streaming_tool_choice_required
test_chat_streaming_basic (earth)
test_chat_streaming_basic (saturn)
test_chat_streaming_image
test_chat_streaming_multi_turn_tool_calling (add_product_tool)
test_chat_streaming_multi_turn_tool_calling (compare_monthly_expense_tool)
test_chat_streaming_multi_turn_tool_calling (get_then_create_event_tool)
test_chat_streaming_multi_turn_tool_calling (text_then_weather_tool)
test_chat_streaming_multi_turn_tool_calling (weather_tool_then_text)
test_chat_streaming_structured_output (calendar)
test_chat_streaming_structured_output (math)
test_chat_streaming_tool_calling
test_chat_streaming_tool_choice_none
test_chat_streaming_tool_choice_required