# Summary:
The current prompt doesn't work well and tend to overindex on tool
calling. This PR is not perfect, but should be an improvement over the
current prompt. We can keep iterating.
# Test Plan:
Ran on a (small) eval with 20 HotpotQA examples.
With current prompt:
https://gist.github.com/ehhuang/9f967e62751907165eb13781ea968f5c
{
│ 'basic::equality': {'accuracy': {'accuracy': 0.2, 'num_correct': 4.0,
'num_total': 20}},
│ 'F1ScoringFn': {
│ │ 'f1_average': 0.25333333333333335,
│ │ 'precision_average': 0.23301767676767676,
│ │ 'recall_average': 0.375
│ }
}
num_tool_calls=[5, 5, 5, 5, 5, 5, 2, 5, 5, 5, 5, 5, 2, 2, 1, 1, 2, 1, 2,
2]
num_examples_with_tool_call=20
num_examples_with_pythontag=0
#########################################################
With new prompt:
https://gist.github.com/ehhuang/6e4a8ecf54db68922c2be8700056f962
{
│ 'basic::equality': {'accuracy': {'accuracy': 0.25, 'num_correct': 5.0,
'num_total': 20}},
│ 'F1ScoringFn': {
│ │ 'f1_average': 0.35579260478321006,
│ │ 'precision_average': 0.32030238933180105,
│ │ 'recall_average': 0.6091666666666666
│ }
}
num_tool_calls=[2, 1, 1, 5, 5, 5, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 3,
2]
num_examples_with_tool_call=20
num_examples_with_pythontag=0
The answers have higher recall, and make fewer tool calls. Note that
these were run with max_infer_iter=5, so the current prompt hits this
limit more often, and without the limit, someitmes goes into infinite
tool calling loop.
The data here is with 3.3-70B. Results are equally poor with either
prompt with 3.2-3B ~30 recall.
llama-models should have extremely minimal cruft. Its sole purpose
should be didactic -- show the simplest implementation of the llama
models and document the prompt formats, etc.
This PR is the complement to
https://github.com/meta-llama/llama-models/pull/279
## Test Plan
Ensure all `llama` CLI `model` sub-commands work:
```bash
llama model list
llama model download --model-id ...
llama model prompt-format -m ...
```
Ran tests:
```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/
LLAMA_STACK_CONFIG=fireworks pytest -s -v vector_io/
LLAMA_STACK_CONFIG=fireworks pytest -s -v agents/
```
Create a fresh venv `uv venv && source .venv/bin/activate` and run
`llama stack build --template fireworks --image-type venv` followed by
`llama stack run together --image-type venv` <-- the server runs
Also checked that the OpenAPI generator can run and there is no change
in the generated files as a result.
```bash
cd docs/openapi_generator
sh run_openapi_generator.sh
```