# What does this PR do?
Rename environment var for consistency
## Test Plan
No regressions
## Sources
## Before submitting
- [X] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Ran pre-commit to handle lint / formatting issues.
- [X] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [X] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
---------
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
We are setting a default value of json for tool prompt format, which
conflicts with llama 3.2/3.3 models since they use python list. This PR
changes the defaults to None and in the code, we infer default based on
the model.
Addresses: #695
Tests:
❯ LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v
tests/client-sdk/inference/test_inference.py -k
"test_text_chat_completion"
pytest llama_stack/providers/tests/inference/test_prompt_adapter.py
When we bump up `major.minor` we want to make sure clients can
immediately detect a version change and appropriately error out. It is
not reasonable to keep checking for API-level backwards compatibility
across such version bumps. Over time, we will make the check based only
on the major version perhaps.
### Test Plan
Manually updated `__version__` in the client SDK to be "0.1.0" which is
incompatible with server's current version "0.0.63", got the following
error:
<img width="1077" alt="image"
src="https://github.com/user-attachments/assets/06ae4659-0a25-4c4c-a999-ce44678d4e6f"
/>
Without this update, the CLI worked correctly.
Add another header so client SDKs can identify their versions which can
be used for immediate detection of possible compatibility issues. A
semver mismatch against the wrong server should be immediately flagged
and requests should be denied.
Also change `X-LlamaStack-ProviderData` to `X-LlamaStack-Provider-Data`
since that hyphenation is better.
# What does this PR do?
PR #639 introduced the notion of Tools API and ability to invoke tools
through API just as any resource. This PR changes the Agents to start
using the Tools API to invoke tools. Major changes include:
1) Ability to specify tool groups with AgentConfig
2) Agent gets the corresponding tool definitions for the specified tools
and pass along to the model
3) Attachements are now named as Documents and their behavior is mostly
unchanged from user perspective
4) You can specify args that can be injected to a tool call through
Agent config. This is especially useful in case of memory tool, where
you want the tool to operate on a specific memory bank.
5) You can also register tool groups with args, which lets the agent
inject these as well into the tool call.
6) All tests have been migrated to use new tools API and fixtures
including client SDK tests
7) Telemetry just works with tools API because of our trace protocol
decorator
## Test Plan
```
pytest -s -v -k fireworks llama_stack/providers/tests/agents/test_agents.py \
--safety-shield=meta-llama/Llama-Guard-3-8B \
--inference-model=meta-llama/Llama-3.1-8B-Instruct
pytest -s -v -k together llama_stack/providers/tests/tools/test_tools.py \
--safety-shield=meta-llama/Llama-Guard-3-8B \
--inference-model=meta-llama/Llama-3.1-8B-Instruct
LLAMA_STACK_CONFIG="/Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml" pytest -v tests/client-sdk/agents/test_agents.py
```
run.yaml:
https://gist.github.com/dineshyv/0365845ad325e1c2cab755788ccc5994
Notebook:
https://colab.research.google.com/drive/1ck7hXQxRl6UvT-ijNRZ-gMZxH1G3cN2d?usp=sharing
This fixes the pre-commit check when running locally (not sure why this
was not caught on CI check):
```
> pre-commit run --show-diff-on-failure --color=always --all-files
trim trailing whitespace.................................................Passed
check python ast.........................................................Passed
check for merge conflicts................................................Passed
check for added large files..............................................Passed
fix end of files.........................................................Passed
Insert license in comments...............................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1
llama_stack/distribution/ui/page/evaluations/app_eval.py:132:65: E226 missing whitespace around arithmetic operator
llama_stack/distribution/ui/page/evaluations/native_eval.py:235:61: E226 missing whitespace around arithmetic operator
llama_stack/providers/utils/telemetry/trace_protocol.py:56:78: E226 missing whitespace around arithmetic operator
```
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
## What does this PR do?
This is a long-pending change and particularly important to get done
now.
Specifically:
- we cannot "localize" (aka download) any URLs from media attachments
anywhere near our modeling code. it must be done within llama-stack.
- `PIL.Image` is infesting all our APIs via `ImageMedia ->
InterleavedTextMedia` and that cannot be right at all. Anything in the
API surface must be "naturally serializable". We need a standard `{
type: "image", image_url: "<...>" }` which is more extensible
- `UserMessage`, `SystemMessage`, etc. are moved completely to
llama-stack from the llama-models repository.
See https://github.com/meta-llama/llama-models/pull/244 for the
corresponding PR in llama-models.
## Test Plan
```bash
cd llama_stack/providers/tests
pytest -s -v -k "fireworks or ollama or together" inference/test_vision_inference.py
pytest -s -v -k "(fireworks or ollama or together) and llama_3b" inference/test_text_inference.py
pytest -s -v -k chroma memory/test_memory.py \
--env EMBEDDING_DIMENSION=384 --env CHROMA_DB_PATH=/tmp/foobar
pytest -s -v -k fireworks agents/test_agents.py \
--safety-shield=meta-llama/Llama-Guard-3-8B \
--inference-model=meta-llama/Llama-3.1-8B-Instruct
```
Updated the client sdk (see PR ...), installed the SDK in the same
environment and then ran the SDK tests:
```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=together pytest -s -v agents/test_agents.py
LLAMA_STACK_CONFIG=ollama pytest -s -v memory/test_memory.py
# this one needed a bit of hacking in the run.yaml to ensure I could register the vision model correctly
INFERENCE_MODEL=llama3.2-vision:latest LLAMA_STACK_CONFIG=ollama pytest -s -v inference/test_inference.py
```
# What does this PR do?
Adds the sentence transformer provider and the `all-MiniLM-L6-v2`
embedding model to the default models to register in the run.yaml for
all providers.
## Test Plan
llama stack build --template together --image-type conda
llama stack run
~/.llama/distributions/llamastack-together/together-run.yaml
### Context
This is the 1st of series PRs that integrate torchtune with llama-stack
as meta reference post-training implementation. For MVP, we will focus
on single device LoRA SFT.
Though this PR is still WIP, we want to get early feedback on the high
level design of this skeleton while still working on several details
### Scope
To limit the scope of this PR, we focus on the skeleton of the
implementation.
**What are included?**
- refine the post-training SFT apis
- skeleton of supervised_fine_tune implementation. We verified that we
can call the supervised_fine_tune API successfully from llama stack
client SDK (client side PR:
https://github.com/meta-llama/llama-stack-client-python/pull/51)
- a very basic single device LoRA training recipe based on torchtune
core components
- parity check with torchtune library and post training api unit test
**What are not includes?**
- implementation of other job management, get training artifacts apis
(separate PR)
- refactor the meta reference inference logic to support eval on
finetuned model (separate PR)
- several necessary functionality in the training recipe such as
logging, validation etc (separate PR)
- interop with telemetry for tracing and metrics logging, currently
temporarily log to local disk (separate PR)
### Testing
**e2e test**
Although we haven't added detailed testing and numerical parity check
with torchtune yet, we did a simple E2E test from client to server
1. setup server with` llama stack build --template
experimental-post-training --image-type conda` and `llama stack run
experimental-post-training `
2. On client, run `llama-stack-client --endpoint
http://devgpu018.nha2.facebook.com:5000 post_training
supervised_fine_tune`
3. Training finishes successfully. On server side, get the finetune
checkpoints under output dir. On client side, get the job uuid
server
<img width="1110" alt="Screenshot 2024-12-02 at 5 52 32 PM"
src="https://github.com/user-attachments/assets/b548eb90-7a9b-4edc-a858-ee237cc4361d">
client
<img width="807" alt="Screenshot 2024-12-02 at 5 52 37 PM"
src="https://github.com/user-attachments/assets/1138ffa8-4698-40fa-b190-3d7b99646838">
**parity check**
torchtune dataloader output and llama-stack post training dataloader
output are same
<img width="1116" alt="Screenshot 2024-12-04 at 8 18 46 PM"
src="https://github.com/user-attachments/assets/5e295cdc-4c24-4ea6-82c0-ca96ef1bd6ee">
torchtune LoRA SFT and llama-stack post training LoRA SFT on alpaca
dataset with llama3.2 3B instruct model are numerical match
<img width="860" alt="Screenshot 2024-12-04 at 8 17 01 PM"
src="https://github.com/user-attachments/assets/c05cf0a8-c674-4d2e-9f0a-c5d01b2dca99">
<img width="1049" alt="Screenshot 2024-12-04 at 8 17 06 PM"
src="https://github.com/user-attachments/assets/b911d4e2-e7b1-41a9-b62c-d75529b6d443">
**unit test **
![Uploading Screenshot 2024-12-09 at 1.35.10 PM.png…]()
This PR does the following:
1) adds the ability to generate embeddings in all supported inference
providers.
2) Moves all the memory providers to use the inference API and improved
the memory tests to setup the inference stack correctly and use the
embedding models
This is a merge from #589 and #598
# What does this PR do?
This PR adds a new model type field to support embedding models to be
registered. Summary of changes:
1) Each registered model by default is an llm model.
2) User can specify an embedding model type, while registering.If
specified, the model bypass the llama model checks since embedding
models can by of any type and based on llama.
3) User needs to include the required embedding dimension in metadata.
This will be used by embedding generation to generate the requried size
of embeddings.
## Test Plan
This PR will go together will need to be merged with two follow up PRs
that will include test plans.
Library client used _server_ side types which was no bueno. The fix here
is not the completely correct fix but it is good for enough and for the
demo notebook.
This brings an interesting aspect -- we need to maintain session-level
tempdir state (!) since the model was told there was some resource at a
given location that it needs to maintain
This PR does a few things:
- it moves "direct client" to llama-stack repo instead of being in the
llama-stack-client-python repo
- renames it to `LlamaStackLibraryClient`
- actually makes synchronous generators work
- makes streaming and non-streaming work properly
In many ways, this PR makes things finally "work"
## Test Plan
See a `library_client_test.py` I added. This isn't really quite a test
yet but it demonstrates that this mode now works. Here's the invocation
and the response:
```
INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct python llama_stack/distribution/tests/library_client_test.py ollama
```

# What does this PR do?
Change the Telemetry API to be able to support different use cases like
returning traces for the UI and ability to export for Evals.
Other changes:
* Add a new trace_protocol decorator to decorate all our API methods so
that any call to them will automatically get traced across all impls.
* There is some issue with the decorator pattern of span creation when
using async generators, where there are multiple yields with in the same
context. I think its much more explicit by using the explicit context
manager pattern using with. I moved the span creations in agent instance
to be using with
* Inject session id at the turn level, which should quickly give us all
traces across turns for a given session
Addresses #509
## Test Plan
```
llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml
PYTHONPATH=. python -m examples.agents.rag_with_memory_bank localhost 5000
curl -X POST 'http://localhost:5000/alpha/telemetry/query-traces' \
-H 'Content-Type: application/json' \
-d '{
"attribute_filters": [
{
"key": "session_id",
"op": "eq",
"value": "dd667b87-ca4b-4d30-9265-5a0de318fc65" }],
"limit": 100,
"offset": 0,
"order_by": ["start_time"]
}' | jq .
[
{
"trace_id": "6902f54b83b4b48be18a6f422b13e16f",
"root_span_id": "5f37b85543afc15a",
"start_time": "2024-12-04T08:08:30.501587",
"end_time": "2024-12-04T08:08:36.026463"
},
{
"trace_id": "92227dac84c0615ed741be393813fb5f",
"root_span_id": "af7c5bb46665c2c8",
"start_time": "2024-12-04T08:08:36.031170",
"end_time": "2024-12-04T08:08:41.693301"
},
{
"trace_id": "7d578a6edac62f204ab479fba82f77b6",
"root_span_id": "1d935e3362676896",
"start_time": "2024-12-04T08:08:41.695204",
"end_time": "2024-12-04T08:08:47.228016"
},
{
"trace_id": "dbd767d76991bc816f9f078907dc9ff2",
"root_span_id": "f5a7ee76683b9602",
"start_time": "2024-12-04T08:08:47.234578",
"end_time": "2024-12-04T08:08:53.189412"
}
]
curl -X POST 'http://localhost:5000/alpha/telemetry/get-span-tree' \
-H 'Content-Type: application/json' \
-d '{ "span_id" : "6cceb4b48a156913", "max_depth": 2, "attributes_to_return": ["input"] }' | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 875 100 790 100 85 18462 1986 --:--:-- --:--:-- --:--:-- 20833
{
"span_id": "6cceb4b48a156913",
"trace_id": "dafa796f6aaf925f511c04cd7c67fdda",
"parent_span_id": "892a66d726c7f990",
"name": "retrieve_rag_context",
"start_time": "2024-12-04T09:28:21.781995",
"end_time": "2024-12-04T09:28:21.913352",
"attributes": {
"input": [
"{\"role\":\"system\",\"content\":\"You are a helpful assistant\"}",
"{\"role\":\"user\",\"content\":\"What are the top 5 topics that were explained in the documentation? Only list succinct bullet points.\",\"context\":null}"
]
},
"children": [
{
"span_id": "1a2df181854064a8",
"trace_id": "dafa796f6aaf925f511c04cd7c67fdda",
"parent_span_id": "6cceb4b48a156913",
"name": "MemoryRouter.query_documents",
"start_time": "2024-12-04T09:28:21.787620",
"end_time": "2024-12-04T09:28:21.906512",
"attributes": {
"input": null
},
"children": [],
"status": "ok"
}
],
"status": "ok"
}
```
<img width="1677" alt="Screenshot 2024-12-04 at 9 42 56 AM"
src="https://github.com/user-attachments/assets/4d3cea93-05ce-415a-93d9-4b1628631bf8">
# What does this PR do?
1) Implement `unregister_dataset(dataset_id)` API in both llama stack
routing table and providers: It removes {dataset_id -> Dataset} mapping
from routing table and removes the dataset_id references in provider as
well (ex. for huggingface, we use a KV store to store the dataset id =>
dataset. we delete it during unregistering as well)
2) expose the datasets/unregister_dataset api endpoint
## Test Plan
**Unit test:**
`
pytest llama_stack/providers/tests/datasetio/test_datasetio.py -m
"huggingface" -v -s --tb=short --disable-warnings
`
**Test on endpoint:**
tested llama stack using an ollama distribution template:
1) start an ollama server
2) Start a llama stack server with the default ollama distribution
config + dataset/datasetsio APIs + datasetio provider
```
---- .../ollama-run.yaml
...
apis:
- agents
- inference
- memory
- safety
- telemetry
- datasetio
- datasets
providers:
datasetio:
- provider_id: localfs
provider_type: inline::localfs
config: {}
...
```
saw that the new API showed up in startup script
```
Serving API datasets
GET /alpha/datasets/get
GET /alpha/datasets/list
POST /alpha/datasets/register
POST /alpha/datasets/unregister
```
3) query `/alpha/datasets/unregister` through curl (since we have not implemented unregister api in llama stack client)
```
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets register
--dataset-id sixian --url
https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/chat.rst
--schema {}
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets list
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ identifier ┃ provider_id ┃ metadata ┃ type ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ sixian │ localfs │ {} │ dataset │
└────────────┴─────────────┴──────────┴─────────┘
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets register
--dataset-id sixian2 --url
https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/chat.rst
--schema {}
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets list
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ identifier ┃ provider_id ┃ metadata ┃ type ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ sixian │ localfs │ {} │ dataset │
│ sixian2 │ localfs │ {} │ dataset │
└────────────┴─────────────┴──────────┴─────────┘
(base) sxyi@sxyi-mbp llama-stack % curl
http://localhost:5001/alpha/datasets/unregister \
-H "Content-Type: application/json" \
-d '{"dataset_id": "sixian"}'
null%
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets list
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ identifier ┃ provider_id ┃ metadata ┃ type ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ sixian2 │ localfs │ {} │ dataset │
└────────────┴─────────────┴──────────┴─────────┘
(base) sxyi@sxyi-mbp llama-stack % curl
http://localhost:5001/alpha/datasets/unregister \
-H "Content-Type: application/json" \
-d '{"dataset_id": "sixian2"}'
null%
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets list
```
## Sources
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
- Move Llama Stack Playground UI to llama-stack repo under
llama_stack/distribution
- Original PR in llama-stack-apps:
https://github.com/meta-llama/llama-stack-apps/pull/127
## Test Plan
```
cd llama-stack/llama_stack/distribution/ui
streamlit run app.py
```
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
- braintrust scoring provider requires OPENAI_API_KEY env variable to be
set
- move this to be able to be set as request headers (e.g. like together
/ fireworks api keys)
- fixes pytest with agents dependency
## Test Plan
**E2E**
```
llama stack run
```
```yaml
scoring:
- provider_id: braintrust-0
provider_type: inline::braintrust
config: {}
```
**Client**
```python
self.client = LlamaStackClient(
base_url=os.environ.get("LLAMA_STACK_ENDPOINT", "http://localhost:5000"),
provider_data={
"openai_api_key": os.environ.get("OPENAI_API_KEY", ""),
},
)
```
- run `llama-stack-client eval run_scoring`
**Unit Test**
```
pytest -v -s -m meta_reference_eval_together_inference eval/test_eval.py
```
```
pytest -v -s -m braintrust_scoring_together_inference scoring/test_scoring.py --env OPENAI_API_KEY=$OPENAI_API_KEY
```
<img width="745" alt="image"
src="https://github.com/user-attachments/assets/68f5cdda-f6c8-496d-8b4f-1b3dabeca9c2">
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.