llama-stack-mirror/llama_stack
ehhuang caffafd101
feat: update the default system prompt for 3.2/3.3 models (#1310)
# Summary:
The current prompt doesn't work well and tend to overindex on tool
calling. This PR is not perfect, but should be an improvement over the
current prompt. We can keep iterating.

# Test Plan:

Ran on a (small) eval with 20 HotpotQA examples.

With current prompt:
https://gist.github.com/ehhuang/9f967e62751907165eb13781ea968f5c
{
│ 'basic::equality': {'accuracy': {'accuracy': 0.2, 'num_correct': 4.0,
'num_total': 20}},
│   'F1ScoringFn': {
│   │   'f1_average': 0.25333333333333335,
│   │   'precision_average': 0.23301767676767676,
│   │   'recall_average': 0.375
│   }
}


num_tool_calls=[5, 5, 5, 5, 5, 5, 2, 5, 5, 5, 5, 5, 2, 2, 1, 1, 2, 1, 2,
2]
num_examples_with_tool_call=20
num_examples_with_pythontag=0


#########################################################
With new prompt:
https://gist.github.com/ehhuang/6e4a8ecf54db68922c2be8700056f962

{
│ 'basic::equality': {'accuracy': {'accuracy': 0.25, 'num_correct': 5.0,
'num_total': 20}},
│   'F1ScoringFn': {
│   │   'f1_average': 0.35579260478321006,
│   │   'precision_average': 0.32030238933180105,
│   │   'recall_average': 0.6091666666666666
│   }
}


num_tool_calls=[2, 1, 1, 5, 5, 5, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 3,
2]
num_examples_with_tool_call=20
num_examples_with_pythontag=0


The answers have higher recall, and make fewer tool calls. Note that
these were run with max_infer_iter=5, so the current prompt hits this
limit more often, and without the limit, someitmes goes into infinite
tool calling loop.

The data here is with 3.3-70B. Results are equally poor with either
prompt with 3.2-3B ~30 recall.
2025-02-27 23:05:42 -08:00
..
apis ci: add mypy for static type checking (#1101) 2025-02-21 13:15:40 -08:00
cli fix: Incorrect import path for print_subcommand_description() (#1315) 2025-02-27 18:50:41 -08:00
distribution fix: ensure ollama embedding model is registered properly in the template 2025-02-27 22:49:06 -08:00
models/llama feat: update the default system prompt for 3.2/3.3 models (#1310) 2025-02-27 23:05:42 -08:00
providers fix: [Litellm]Do not swallow first token (#1316) 2025-02-27 20:53:47 -08:00
scripts ci: add mypy for static type checking (#1101) 2025-02-21 13:15:40 -08:00
strong_typing Ensure that deprecations for fields follow through to OpenAPI 2025-02-19 13:54:04 -08:00
templates fix: ensure ollama embedding model is registered properly in the template 2025-02-27 22:49:06 -08:00
__init__.py export LibraryClient 2024-12-13 12:08:00 -08:00
schema_utils.py ci: add mypy for static type checking (#1101) 2025-02-21 13:15:40 -08:00