llama-stack

phoenix-oss/llama-stack

Fork 0

forked from phoenix-oss/llama-stack-mirror

Commit graph

Author	SHA1	Message	Date
Ashwin Bharambe	8bbd52bb9f	chore: remove dependency on llama_models completely (#1344 )	2025-03-01 12:48:08 -08:00
ehhuang	caffafd101	feat: update the default system prompt for 3.2/3.3 models (#1310 ) # Summary: The current prompt doesn't work well and tend to overindex on tool calling. This PR is not perfect, but should be an improvement over the current prompt. We can keep iterating. # Test Plan: Ran on a (small) eval with 20 HotpotQA examples. With current prompt: https://gist.github.com/ehhuang/9f967e62751907165eb13781ea968f5c { │ 'basic::equality': {'accuracy': {'accuracy': 0.2, 'num_correct': 4.0, 'num_total': 20}}, │ 'F1ScoringFn': { │ │ 'f1_average': 0.25333333333333335, │ │ 'precision_average': 0.23301767676767676, │ │ 'recall_average': 0.375 │ } } num_tool_calls=[5, 5, 5, 5, 5, 5, 2, 5, 5, 5, 5, 5, 2, 2, 1, 1, 2, 1, 2, 2] num_examples_with_tool_call=20 num_examples_with_pythontag=0 ######################################################### With new prompt: https://gist.github.com/ehhuang/6e4a8ecf54db68922c2be8700056f962 { │ 'basic::equality': {'accuracy': {'accuracy': 0.25, 'num_correct': 5.0, 'num_total': 20}}, │ 'F1ScoringFn': { │ │ 'f1_average': 0.35579260478321006, │ │ 'precision_average': 0.32030238933180105, │ │ 'recall_average': 0.6091666666666666 │ } } num_tool_calls=[2, 1, 1, 5, 5, 5, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 3, 2] num_examples_with_tool_call=20 num_examples_with_pythontag=0 The answers have higher recall, and make fewer tool calls. Note that these were run with max_infer_iter=5, so the current prompt hits this limit more often, and without the limit, someitmes goes into infinite tool calling loop. The data here is with 3.3-70B. Results are equally poor with either prompt with 3.2-3B ~30 recall.	2025-02-27 23:05:42 -08:00
Ashwin Bharambe	314ee09ae3	chore: move all Llama Stack types from llama-models to llama-stack (#1098 ) llama-models should have extremely minimal cruft. Its sole purpose should be didactic -- show the simplest implementation of the llama models and document the prompt formats, etc. This PR is the complement to https://github.com/meta-llama/llama-models/pull/279 ## Test Plan Ensure all `llama` CLI `model` sub-commands work: ```bash llama model list llama model download --model-id ... llama model prompt-format -m ... ``` Ran tests: ```bash cd tests/client-sdk LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/ LLAMA_STACK_CONFIG=fireworks pytest -s -v vector_io/ LLAMA_STACK_CONFIG=fireworks pytest -s -v agents/ ``` Create a fresh venv `uv venv && source .venv/bin/activate` and run `llama stack build --template fireworks --image-type venv` followed by `llama stack run together --image-type venv` <-- the server runs Also checked that the OpenAPI generator can run and there is no change in the generated files as a result. ```bash cd docs/openapi_generator sh run_openapi_generator.sh ```	2025-02-14 09:10:59 -08:00

Author

SHA1

Message

Date

Ashwin Bharambe

8bbd52bb9f

chore: remove dependency on llama_models completely (#1344 )

2025-03-01 12:48:08 -08:00

ehhuang

caffafd101

feat: update the default system prompt for 3.2/3.3 models (#1310 )

# Summary:
The current prompt doesn't work well and tend to overindex on tool
calling. This PR is not perfect, but should be an improvement over the
current prompt. We can keep iterating.

# Test Plan:

Ran on a (small) eval with 20 HotpotQA examples.

With current prompt:
https://gist.github.com/ehhuang/9f967e62751907165eb13781ea968f5c
{
│ 'basic::equality': {'accuracy': {'accuracy': 0.2, 'num_correct': 4.0,
'num_total': 20}},
│   'F1ScoringFn': {
│   │   'f1_average': 0.25333333333333335,
│   │   'precision_average': 0.23301767676767676,
│   │   'recall_average': 0.375
│   }
}


num_tool_calls=[5, 5, 5, 5, 5, 5, 2, 5, 5, 5, 5, 5, 2, 2, 1, 1, 2, 1, 2,
2]
num_examples_with_tool_call=20
num_examples_with_pythontag=0


#########################################################
With new prompt:
https://gist.github.com/ehhuang/6e4a8ecf54db68922c2be8700056f962

{
│ 'basic::equality': {'accuracy': {'accuracy': 0.25, 'num_correct': 5.0,
'num_total': 20}},
│   'F1ScoringFn': {
│   │   'f1_average': 0.35579260478321006,
│   │   'precision_average': 0.32030238933180105,
│   │   'recall_average': 0.6091666666666666
│   }
}


num_tool_calls=[2, 1, 1, 5, 5, 5, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 3,
2]
num_examples_with_tool_call=20
num_examples_with_pythontag=0


The answers have higher recall, and make fewer tool calls. Note that
these were run with max_infer_iter=5, so the current prompt hits this
limit more often, and without the limit, someitmes goes into infinite
tool calling loop.

The data here is with 3.3-70B. Results are equally poor with either
prompt with 3.2-3B ~30 recall.

2025-02-27 23:05:42 -08:00

Ashwin Bharambe

314ee09ae3

chore: move all Llama Stack types from llama-models to llama-stack (#1098 )

llama-models should have extremely minimal cruft. Its sole purpose
should be didactic -- show the simplest implementation of the llama
models and document the prompt formats, etc.

This PR is the complement to
https://github.com/meta-llama/llama-models/pull/279

## Test Plan

Ensure all `llama` CLI `model` sub-commands work:

```bash
llama model list
llama model download --model-id ...
llama model prompt-format -m ...
```

Ran tests:
```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/
LLAMA_STACK_CONFIG=fireworks pytest -s -v vector_io/
LLAMA_STACK_CONFIG=fireworks pytest -s -v agents/
```

Create a fresh venv `uv venv && source .venv/bin/activate` and run
`llama stack build --template fireworks --image-type venv` followed by
`llama stack run together --image-type venv` <-- the server runs

Also checked that the OpenAPI generator can run and there is no change
in the generated files as a result.

```bash
cd docs/openapi_generator
sh run_openapi_generator.sh
```

2025-02-14 09:10:59 -08:00

3 commits