feat: update the default system prompt for 3.2/3.3 models (#1310)

# Summary:
The current prompt doesn't work well and tend to overindex on tool
calling. This PR is not perfect, but should be an improvement over the
current prompt. We can keep iterating.

# Test Plan:

Ran on a (small) eval with 20 HotpotQA examples.

With current prompt:
https://gist.github.com/ehhuang/9f967e62751907165eb13781ea968f5c
{
│ 'basic::equality': {'accuracy': {'accuracy': 0.2, 'num_correct': 4.0,
'num_total': 20}},
│   'F1ScoringFn': {
│   │   'f1_average': 0.25333333333333335,
│   │   'precision_average': 0.23301767676767676,
│   │   'recall_average': 0.375
│   }
}


num_tool_calls=[5, 5, 5, 5, 5, 5, 2, 5, 5, 5, 5, 5, 2, 2, 1, 1, 2, 1, 2,
2]
num_examples_with_tool_call=20
num_examples_with_pythontag=0


#########################################################
With new prompt:
https://gist.github.com/ehhuang/6e4a8ecf54db68922c2be8700056f962

{
│ 'basic::equality': {'accuracy': {'accuracy': 0.25, 'num_correct': 5.0,
'num_total': 20}},
│   'F1ScoringFn': {
│   │   'f1_average': 0.35579260478321006,
│   │   'precision_average': 0.32030238933180105,
│   │   'recall_average': 0.6091666666666666
│   }
}


num_tool_calls=[2, 1, 1, 5, 5, 5, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 3,
2]
num_examples_with_tool_call=20
num_examples_with_pythontag=0


The answers have higher recall, and make fewer tool calls. Note that
these were run with max_infer_iter=5, so the current prompt hits this
limit more often, and without the limit, someitmes goes into infinite
tool calling loop.

The data here is with 3.3-70B. Results are equally poor with either
prompt with 3.2-3B ~30 recall.
This commit is contained in:
ehhuang 2025-02-27 23:05:42 -08:00 committed by GitHub
parent ece354eedd
commit caffafd101
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -226,10 +226,9 @@ class FunctionTagCustomToolGenerator(PromptTemplateGeneratorBase):
class PythonListCustomToolGenerator(PromptTemplateGeneratorBase): # noqa: N801
DEFAULT_PROMPT = textwrap.dedent(
"""
You are a helpful assistant. You have access to functions, but you should only use them if they are required.
You are an expert in composing functions. You are given a question and a set of possible functions.
Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function,
also point it out. You should only return the function call in tools call sections.
Based on the question, you may or may not need to make one function/tool call to achieve the purpose.
{{ function_description }}
""".strip("\n")