mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-03 18:00:36 +00:00
reduce the accuracy requirements to pass the chat completion structured output test (#522)
i find `test_structured_output` to be flakey. it's both a functionality
and accuracy test -
```
answer = AnswerFormat.model_validate_json(response.completion_message.content)
assert answer.first_name == "Michael"
assert answer.last_name == "Jordan"
assert answer.year_of_birth == 1963
assert answer.num_seasons_in_nba == 15
```
it's an accuracy test because it checks the value of first/last name,
birth year, and num seasons.
i find that -
- llama-3.1-8b-instruct and llama-3.2-3b-instruct pass the functionality
portion
- llama-3.2-3b-instruct consistently fails the accuracy portion
(thinking MJ was in the NBA for 14 seasons)
- llama-3.1-8b-instruct occasionally fails the accuracy portion
suggestions (not mutually exclusive) -
1. turn the test into functionality only, skip the value checks
2. split the test into a functionality version and an xfail accuracy
version
3. add context to the prompt so the llm can answer without accessing
embedded memory
# What does this PR do?
implements option (3) by adding context to the system prompt.
## Test Plan
`pytest -s -v ... llama_stack/providers/tests/inference/ ... -k
structured_output`
## Before submitting
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
This commit is contained in:
parent
4c7b1a8fb3
commit
435f34b05e
1 changed files with 9 additions and 1 deletions
|
|
@ -211,7 +211,15 @@ class TestInference:
|
||||||
response = await inference_impl.chat_completion(
|
response = await inference_impl.chat_completion(
|
||||||
model_id=inference_model,
|
model_id=inference_model,
|
||||||
messages=[
|
messages=[
|
||||||
SystemMessage(content="You are a helpful assistant."),
|
# we include context about Michael Jordan in the prompt so that the test is
|
||||||
|
# focused on the funtionality of the model and not on the information embedded
|
||||||
|
# in the model. Llama 3.2 3B Instruct tends to think MJ played for 14 seasons.
|
||||||
|
SystemMessage(
|
||||||
|
content=(
|
||||||
|
"You are a helpful assistant.\n\n"
|
||||||
|
"Michael Jordan was born in 1963. He played basketball for the Chicago Bulls for 15 seasons."
|
||||||
|
)
|
||||||
|
),
|
||||||
UserMessage(content="Please give me information about Michael Jordan."),
|
UserMessage(content="Please give me information about Michael Jordan."),
|
||||||
],
|
],
|
||||||
stream=False,
|
stream=False,
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue