Library client used _server_ side types which was no bueno. The fix here
is not the completely correct fix but it is good for enough and for the
demo notebook.
This brings an interesting aspect -- we need to maintain session-level
tempdir state (!) since the model was told there was some resource at a
given location that it needs to maintain
This PR does a few things:
- it moves "direct client" to llama-stack repo instead of being in the
llama-stack-client-python repo
- renames it to `LlamaStackLibraryClient`
- actually makes synchronous generators work
- makes streaming and non-streaming work properly
In many ways, this PR makes things finally "work"
## Test Plan
See a `library_client_test.py` I added. This isn't really quite a test
yet but it demonstrates that this mode now works. Here's the invocation
and the response:
```
INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct python llama_stack/distribution/tests/library_client_test.py ollama
```

# What does this PR do?
Change the Telemetry API to be able to support different use cases like
returning traces for the UI and ability to export for Evals.
Other changes:
* Add a new trace_protocol decorator to decorate all our API methods so
that any call to them will automatically get traced across all impls.
* There is some issue with the decorator pattern of span creation when
using async generators, where there are multiple yields with in the same
context. I think its much more explicit by using the explicit context
manager pattern using with. I moved the span creations in agent instance
to be using with
* Inject session id at the turn level, which should quickly give us all
traces across turns for a given session
Addresses #509
## Test Plan
```
llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml
PYTHONPATH=. python -m examples.agents.rag_with_memory_bank localhost 5000
curl -X POST 'http://localhost:5000/alpha/telemetry/query-traces' \
-H 'Content-Type: application/json' \
-d '{
"attribute_filters": [
{
"key": "session_id",
"op": "eq",
"value": "dd667b87-ca4b-4d30-9265-5a0de318fc65" }],
"limit": 100,
"offset": 0,
"order_by": ["start_time"]
}' | jq .
[
{
"trace_id": "6902f54b83b4b48be18a6f422b13e16f",
"root_span_id": "5f37b85543afc15a",
"start_time": "2024-12-04T08:08:30.501587",
"end_time": "2024-12-04T08:08:36.026463"
},
{
"trace_id": "92227dac84c0615ed741be393813fb5f",
"root_span_id": "af7c5bb46665c2c8",
"start_time": "2024-12-04T08:08:36.031170",
"end_time": "2024-12-04T08:08:41.693301"
},
{
"trace_id": "7d578a6edac62f204ab479fba82f77b6",
"root_span_id": "1d935e3362676896",
"start_time": "2024-12-04T08:08:41.695204",
"end_time": "2024-12-04T08:08:47.228016"
},
{
"trace_id": "dbd767d76991bc816f9f078907dc9ff2",
"root_span_id": "f5a7ee76683b9602",
"start_time": "2024-12-04T08:08:47.234578",
"end_time": "2024-12-04T08:08:53.189412"
}
]
curl -X POST 'http://localhost:5000/alpha/telemetry/get-span-tree' \
-H 'Content-Type: application/json' \
-d '{ "span_id" : "6cceb4b48a156913", "max_depth": 2, "attributes_to_return": ["input"] }' | jq .
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 875 100 790 100 85 18462 1986 --:--:-- --:--:-- --:--:-- 20833
{
"span_id": "6cceb4b48a156913",
"trace_id": "dafa796f6aaf925f511c04cd7c67fdda",
"parent_span_id": "892a66d726c7f990",
"name": "retrieve_rag_context",
"start_time": "2024-12-04T09:28:21.781995",
"end_time": "2024-12-04T09:28:21.913352",
"attributes": {
"input": [
"{\"role\":\"system\",\"content\":\"You are a helpful assistant\"}",
"{\"role\":\"user\",\"content\":\"What are the top 5 topics that were explained in the documentation? Only list succinct bullet points.\",\"context\":null}"
]
},
"children": [
{
"span_id": "1a2df181854064a8",
"trace_id": "dafa796f6aaf925f511c04cd7c67fdda",
"parent_span_id": "6cceb4b48a156913",
"name": "MemoryRouter.query_documents",
"start_time": "2024-12-04T09:28:21.787620",
"end_time": "2024-12-04T09:28:21.906512",
"attributes": {
"input": null
},
"children": [],
"status": "ok"
}
],
"status": "ok"
}
```
<img width="1677" alt="Screenshot 2024-12-04 at 9 42 56 AM"
src="https://github.com/user-attachments/assets/4d3cea93-05ce-415a-93d9-4b1628631bf8">
# What does this PR do?
1) Implement `unregister_dataset(dataset_id)` API in both llama stack
routing table and providers: It removes {dataset_id -> Dataset} mapping
from routing table and removes the dataset_id references in provider as
well (ex. for huggingface, we use a KV store to store the dataset id =>
dataset. we delete it during unregistering as well)
2) expose the datasets/unregister_dataset api endpoint
## Test Plan
**Unit test:**
`
pytest llama_stack/providers/tests/datasetio/test_datasetio.py -m
"huggingface" -v -s --tb=short --disable-warnings
`
**Test on endpoint:**
tested llama stack using an ollama distribution template:
1) start an ollama server
2) Start a llama stack server with the default ollama distribution
config + dataset/datasetsio APIs + datasetio provider
```
---- .../ollama-run.yaml
...
apis:
- agents
- inference
- memory
- safety
- telemetry
- datasetio
- datasets
providers:
datasetio:
- provider_id: localfs
provider_type: inline::localfs
config: {}
...
```
saw that the new API showed up in startup script
```
Serving API datasets
GET /alpha/datasets/get
GET /alpha/datasets/list
POST /alpha/datasets/register
POST /alpha/datasets/unregister
```
3) query `/alpha/datasets/unregister` through curl (since we have not implemented unregister api in llama stack client)
```
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets register
--dataset-id sixian --url
https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/chat.rst
--schema {}
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets list
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ identifier ┃ provider_id ┃ metadata ┃ type ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ sixian │ localfs │ {} │ dataset │
└────────────┴─────────────┴──────────┴─────────┘
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets register
--dataset-id sixian2 --url
https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/chat.rst
--schema {}
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets list
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ identifier ┃ provider_id ┃ metadata ┃ type ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ sixian │ localfs │ {} │ dataset │
│ sixian2 │ localfs │ {} │ dataset │
└────────────┴─────────────┴──────────┴─────────┘
(base) sxyi@sxyi-mbp llama-stack % curl
http://localhost:5001/alpha/datasets/unregister \
-H "Content-Type: application/json" \
-d '{"dataset_id": "sixian"}'
null%
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets list
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┓
┃ identifier ┃ provider_id ┃ metadata ┃ type ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━┩
│ sixian2 │ localfs │ {} │ dataset │
└────────────┴─────────────┴──────────┴─────────┘
(base) sxyi@sxyi-mbp llama-stack % curl
http://localhost:5001/alpha/datasets/unregister \
-H "Content-Type: application/json" \
-d '{"dataset_id": "sixian2"}'
null%
(base) sxyi@sxyi-mbp llama-stack % llama-stack-client datasets list
```
## Sources
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
- Move Llama Stack Playground UI to llama-stack repo under
llama_stack/distribution
- Original PR in llama-stack-apps:
https://github.com/meta-llama/llama-stack-apps/pull/127
## Test Plan
```
cd llama-stack/llama_stack/distribution/ui
streamlit run app.py
```
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
- braintrust scoring provider requires OPENAI_API_KEY env variable to be
set
- move this to be able to be set as request headers (e.g. like together
/ fireworks api keys)
- fixes pytest with agents dependency
## Test Plan
**E2E**
```
llama stack run
```
```yaml
scoring:
- provider_id: braintrust-0
provider_type: inline::braintrust
config: {}
```
**Client**
```python
self.client = LlamaStackClient(
base_url=os.environ.get("LLAMA_STACK_ENDPOINT", "http://localhost:5000"),
provider_data={
"openai_api_key": os.environ.get("OPENAI_API_KEY", ""),
},
)
```
- run `llama-stack-client eval run_scoring`
**Unit Test**
```
pytest -v -s -m meta_reference_eval_together_inference eval/test_eval.py
```
```
pytest -v -s -m braintrust_scoring_together_inference scoring/test_scoring.py --env OPENAI_API_KEY=$OPENAI_API_KEY
```
<img width="745" alt="image"
src="https://github.com/user-attachments/assets/68f5cdda-f6c8-496d-8b4f-1b3dabeca9c2">
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
This PR fixes some of the issues with our telemetry setup to enable logs
to be delivered to opentelemetry and jaeger. Main fixes
1) Updates the open telemetry provider to use the latest oltp exports
instead of deprected ones.
2) Adds a tracing middleware, which injects traces into each HTTP
request that the server recieves and this is going to be the root trace.
Previously, we did this in the create_dynamic_route method, which is
actually not the actual exectuion flow, but more of a config and this
causes the traces to end prematurely. Through middleware, we plugin the
trace start and end at the right location.
3) We manage our own methods to create traces and spans and this does
not fit well with Opentelemetry SDK since it does not support provide a
way to take in traces and spans that are already created. it expects us
to use the SDK to create them. For now, I have a hacky approach of just
maintaining a map from our internal telemetry objects to the open
telemetry specfic ones. This is not the ideal solution. I will explore
other ways to get around this issue. for now, to have something that
works, i am going to keep this as is.
Addresses: #509
# What does this PR do?
This PR moves all print statements to use logging. Things changed:
- Had to add `await start_trace("sse_generator")` to server.py to
actually get tracing working. else was not seeing any logs
- If no telemetry provider is provided in the run.yaml, we will write to
stdout
- by default, the logs are going to be in JSON, but we expose an option
to configure to output in a human readable way.
When running with dockers, the idea is that users be able to work purely
with the `llama stack` CLI. They should not need to know about the
existence of any YAMLs unless they need to. This PR enables it.
The docker command now doesn't need to volume mount a yaml and can
simply be:
```bash
docker run -v ~/.llama/:/root/.llama \
--env A=a --env B=b
```
## Test Plan
Check with conda first (no regressions):
```bash
LLAMA_STACK_DIR=. llama stack build --template ollama
llama stack run ollama --port 5001
# server starts up correctly
```
Check with docker
```bash
# build the docker
LLAMA_STACK_DIR=. llama stack build --template ollama --image-type docker
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
docker run -it -p 5001:5001 \
-v ~/.llama:/root/.llama \
-v $PWD:/app/llama-stack-source \
localhost/distribution-ollama:dev \
--port 5001 \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://host.docker.internal:11434
```
Note that volume mounting to `/app/llama-stack-source` is only needed
because we built the docker with uncommitted source code.
# What does this PR do?
Remove a check which skips provider registration if a resource is
already in stack registry. Since we do not reconcile state with
provider, register should always call into provider's register endpoint.
## Test Plan
```
# stack run
╰─❯ llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml
#register memory bank
❯ llama-stack-client memory_banks register your_memory_bank_name --type vector --provider-id inline::faiss-0
Memory Bank Configuration:
{
│ 'memory_bank_type': 'vector',
│ 'chunk_size_in_tokens': 512,
│ 'embedding_model': 'all-MiniLM-L6-v2',
│ 'overlap_size_in_tokens': 64
}
#register again
❯ llama-stack-client memory_banks register your_memory_bank_name --type vector --provider-id inline::faiss-0
Memory Bank Configuration:
{
│ 'memory_bank_type': 'vector',
│ 'chunk_size_in_tokens': 512,
│ 'embedding_model': 'all-MiniLM-L6-v2',
│ 'overlap_size_in_tokens': 64
}
```
# What does this PR do?
Adds a `/alpha/` prefix to all the REST API urls.
Also makes them all use hyphens instead of underscores as is more
standard practice.
(This is based on feedback from our partners.)
## Test Plan
The Stack itself does not need updating. However, client SDKs and
documentation will need to be updated.
This PR adds a method in stack to return the stackrunconfig object based
on the template name. This will be used to instantiate a direct client
without the need for an explicit run.yaml
---------
Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
# What does this PR do?
Automatically generates
- build.yaml
- run.yaml
- run-with-safety.yaml
- parts of markdown docs
for the distributions.
## Test Plan
At this point, this only updates the YAMLs and the docs. Some testing
(especially with ollama and vllm) has been performed but needs to be
much more tested.
The semantics of an Update on resources is very tricky to reason about
especially for memory banks and models. The best way to go forward here
is for the user to unregister and register a new resource. We don't have
a compelling reason to support update APIs.
Tests:
pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m
"chroma" --env CHROMA_HOST=localhost --env CHROMA_PORT=8000
pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m
"pgvector" --env PGVECTOR_DB=postgres --env PGVECTOR_USER=postgres --env
PGVECTOR_PASSWORD=mysecretpassword --env PGVECTOR_HOST=0.0.0.0
$CONDA_PREFIX/bin/pytest -v -s -m "ollama"
llama_stack/providers/tests/inference/test_model_registration.py
---------
Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
We are calling the initialize function on the registery in the common
routing table impl, which is incorrect as the common routing table is
the base class inherited by each resource's routing table. this change
moves remove that and add the initialize to the creation, where it inits
once server run.
Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
This PR makes the following changes:
1) Fixes the get_all and initialize impl to actually read the values
returned from the range call to kvstore and not keys.
2) The start_key and end_key are fixed to correct perform the range
query after the key format changes
3) Made the cache registry thread safe since there are multiple
initializes called for each routing table.
Tests:
* Start stack
* Register dataset
* Kill stack
* Bring stack up
* dataset list
```
llama-stack-client datasets list
+--------------+---------------+---------------------------------------------------------------------------------+---------+
| identifier | provider_id | metadata | type |
+==============+===============+=================================================================================+=========+
| alpaca | huggingface-0 | {} | dataset |
+--------------+---------------+---------------------------------------------------------------------------------+---------+
| mmlu | huggingface-0 | {'path': 'llama-stack/evals', 'name': 'evals__mmlu__details', 'split': 'train'} | dataset |
+--------------+---------------+---------------------------------------------------------------------------------+---------+
```
Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>
# What does this PR do?
We'd like our docker steps to require _ZERO EDITS_ to a YAML file in
order to get going. This is often not possible because depending on the
provider, we do need some configuration input from the user. Environment
variables are the best way to obtain this information.
This PR allows our run.yaml to contain `${env.FOO_BAR}` placeholders
which can be replaced using `docker run -e FOO_BAR=baz` (and similar
`docker compose` equivalent).
## Test Plan
For remote-vllm, example `run.yaml` snippet looks like this:
```yaml
providers:
inference:
# serves main inference model
- provider_id: vllm-0
provider_type: remote::vllm
config:
# NOTE: replace with "localhost" if you are running in "host" network mode
url: ${env.LLAMA_INFERENCE_VLLM_URL:http://host.docker.internal:5100/v1}
max_tokens: ${env.MAX_TOKENS:4096}
api_token: fake
# serves safety llama_guard model
- provider_id: vllm-1
provider_type: remote::vllm
config:
# NOTE: replace with "localhost" if you are running in "host" network mode
url: ${env.LLAMA_SAFETY_VLLM_URL:http://host.docker.internal:5101/v1}
max_tokens: ${env.MAX_TOKENS:4096}
api_token: fake
```
`compose.yaml` snippet looks like this:
```yaml
llamastack:
depends_on:
- vllm-0
- vllm-1
# image: llamastack/distribution-remote-vllm
image: llamastack/distribution-remote-vllm:test-0.0.52rc3
volumes:
- ~/.llama:/root/.llama
- ~/local/llama-stack/distributions/remote-vllm/run.yaml:/root/llamastack-run-remote-vllm.yaml
# network_mode: "host"
environment:
- LLAMA_INFERENCE_VLLM_URL=${LLAMA_INFERENCE_VLLM_URL:-http://host.docker.internal:5100/v1}
- LLAMA_INFERENCE_MODEL=${LLAMA_INFERENCE_MODEL:-Llama3.1-8B-Instruct}
- MAX_TOKENS=${MAX_TOKENS:-4096}
- SQLITE_STORE_DIR=${SQLITE_STORE_DIR:-$HOME/.llama/distributions/remote-vllm}
- LLAMA_SAFETY_VLLM_URL=${LLAMA_SAFETY_VLLM_URL:-http://host.docker.internal:5101/v1}
- LLAMA_SAFETY_MODEL=${LLAMA_SAFETY_MODEL:-Llama-Guard-3-1B}
```
# What does this PR do?
- API updates: change schema to dataset_schema for register_dataset for
resolving pydantic naming conflict
- Note: this OpenAPI update will be synced with
llama-stack-client-python SDK.
cc @dineshyv
## Test Plan
```
pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio eval/test_eval.py
```
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.