# What does this PR do?
The goal is to promote the minimal set of dependencies the project needs
to run, this includes:
* dependencies needed to work with the CLI
* dependencies needed for the server to run with no providers
This also:
* Relocate redundant dependencies out of the core project and into the
individual providers that actually require them.
* Include all necessary server dependencies so the project can run
standalone, even without any providers.
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Build and run distro a server.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
CI tests have been failing with
.venv/lib/python3.12/site-packages/peft/auto.py:21: in <module>
from transformers import (
.venv/lib/python3.12/site-packages/transformers/__init__.py:27: in
<module>
from . import dependency_versions_check
.venv/lib/python3.12/site-packages/transformers/dependency_versions_check.py:57:
in <module>
require_version_core(deps[pkg])
.venv/lib/python3.12/site-packages/transformers/utils/versions.py:117:
in require_version_core
return require_version(requirement, hint)
.venv/lib/python3.12/site-packages/transformers/utils/versions.py:111:
in require_version
_compare_versions(op, got_ver, want_ver, requirement, pkg, hint)
.venv/lib/python3.12/site-packages/transformers/utils/versions.py:44: in
_compare_versions
raise ImportError(
E ImportError: huggingface-hub>=0.30.0,<1.0 is required for a normal
functioning of this module, but found huggingface-hub==0.29.0.
E Try: `pip install transformers -U` or `pip install -e '.[dev]'` if
you're working with git main
------------------------------ Captured log setup
------------------------------
INFO llama_stack.providers.remote.inference.ollama.ollama:ollama.py:106
checking connectivity to Ollama at `http://0.0.0.0:11434`.../
=========================== short test summary info
============================
ERROR
tests/integration/providers/test_providers.py::TestProviders::test_providers
- ImportError: huggingface-hub>=0.30.0,<1.0 is required for a normal
functioning of this module, but found huggingface-hub==0.29.0.
Try: `pip install transformers -U` or `pip install -e '.[dev]'` if
you're working with git main
=================== 1 skipped, 4 warnings, 1 error in 9.52s
====================
## Test Plan
CI
# What does this PR do?
dropped python3.10, updated pyproject and dependencies, and also removed
some blocks of code with special handling for enum.StrEnum
Closes#2458
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Adds try-catch to faiss `query_vector` function for when the distance
between the query embedding and an embedding within the vector db is 0
(identical vectors). Catches `ZeroDivisionError` and then appends `(1.0
/ sys.float_info.min)` to `scores` to represent maximum similarity.
<!-- If resolving an issue, uncomment and update the line below -->
Closes [#2381]
## Test Plan
Checkout this PR
Execute this code and there will no longer be a `ZeroDivisionError`
exception
```
from llama_stack_client import LlamaStackClient
base_url = "http://localhost:8321"
client = LlamaStackClient(base_url=base_url)
models = client.models.list()
embedding_model = (
em := next(m for m in models if m.model_type == "embedding")
).identifier
embedding_dimension = 384
_ = client.vector_dbs.register(
vector_db_id="foo_db",
embedding_model=embedding_model,
embedding_dimension=embedding_dimension,
provider_id="faiss",
)
chunk = {
"content": "foo",
"mime_type": "text/plain",
"metadata": {
"document_id": "foo-id"
}
}
client.vector_io.insert(vector_db_id="foo_db", chunks=[chunk])
client.vector_io.query(vector_db_id="foo_db", query="foo")
```
### Running unit tests
`uv run pytest tests/unit/rag/test_rag_query.py -v`
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ben Browning <bbrownin@redhat.com>
trying to run `llama` cli after installing wheel fails with this error
```
Traceback (most recent call last):
File "/tmp/tmp.wdZath9U6j/.venv/bin/llama", line 4, in <module>
from llama_stack.cli.llama import main
File "/tmp/tmp.wdZath9U6j/.venv/lib/python3.10/site-packages/llama_stack/__init__.py", line 7, in <module>
from llama_stack.distribution.library_client import ( # noqa: F401
ModuleNotFoundError: No module named 'llama_stack.distribution.library_client'
```
This PR fixes it by ensurring that all sub-directories of `llama_stack`
are also included.
Also, fixes the missing `fastapi` dependency issue.
# What does this PR do?
TSIA
Added Files provider to the fireworks template. Might want to add to all
templates as a follow-up.
## Test Plan
llama-stack pytest tests/unit/files/test_files.py
llama-stack llama stack build --template fireworks --image-type conda
--run
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -s -v
tests/integration/files/
# What does this PR do?
Use a more common pattern and known terminology from the ecosystem,
where Route is more approved than Endpoint.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The previous `[project.optional-dependencies]` was misrepresenting what
the packages were. They were NOT optional dependencies to the project
but development dependencies. Unlike optional dependencies, development
dependencies are local-only and will not be included in the project
requirements when published to PyPI or other indexes. As such,
development dependencies are not included in the [project] table.
Additionally, the dev group is synced by default.
Source:
https://docs.astral.sh/uv/concepts/projects/dependencies/#development-dependencies
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This is not a core dependency of the distro server. It's only necessary
when using `inline::rag-runtime` or `inline::meta-reference` providers.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This fixes a high vulnerable CVE in `setuptools`:
https://github.com/advisories/GHSA-5rjg-fvgr-3xxf
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
* Provide sqlite implementation of the APIs introduced in
https://github.com/meta-llama/llama-stack/pull/2145.
* Introduced a SqlStore API: llama_stack/providers/utils/sqlstore/api.py
and the first Sqlite implementation
* Pagination support will be added in a future PR.
## Test Plan
Unit test on sql store:
<img width="1005" alt="image"
src="https://github.com/user-attachments/assets/9b8b7ec8-632b-4667-8127-5583426b2e29"
/>
Integration test:
```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" llama stack build --template ollama --image-type conda --run
```
```
LLAMA_STACK_CONFIG=http://localhost:5001 INFERENCE_MODEL="llama3.2:3b-instruct-fp16" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-fp16" -k 'inference_store and openai'
```
# What does this PR do?
* remove requirements.txt to use pyproject.toml as the source of truth
* update relevant docs
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Kubernetes since 1.20 exposes a JWKS endpoint that we can use with our
recent oauth2 recent implementation.
The CI test has been kept intact for validation.
Signed-off-by: Sébastien Han <seb@redhat.com>
This PR adds a notion of `principal` (aka some kind of persistent
identity) to the authentication infrastructure of the Stack. Until now
we only used access attributes ("claims" in the more standard OAuth /
OIDC setup) but we need the notion of a User fundamentally as well.
(Thanks @rhuss for bringing this up.)
This value is not yet _used_ anywhere downstream but will be used to
segregate access to resources.
In addition, the PR introduces a built-in JWT token validator so the
Stack does not need to contact an authentication provider to validating
the authorization and merely check the signed token for the represented
claims. Public keys are refreshed via the configured JWKS server. This
Auth Provider should overwhelmingly be considered the default given the
seamless integration it offers with OAuth setups.
# What does this PR do?
adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a
super popular option for running training jobs. The config allows a user
to specify some key fields such as a model, chat_template, device, etc
the provider comes with one recipe `finetune_single_device` which works
both with and without LoRA.
any model that is a valid HF identifier can be given and the model will
be pulled.
this has been tested so far with CPU and MPS device types, but should be
compatible with CUDA out of the box
The provider processes the given dataset into the proper format,
establishes the various steps per epoch, steps per save, steps per eval,
sets a sane SFTConfig, and runs n_epochs of training
if checkpoint_dir is none, no model is saved. If there is a checkpoint
dir, a model is saved every `save_steps` and at the end of training.
## Test Plan
re-enabled post_training integration test suite with a singular test
that loads the simpleqa dataset:
https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite
model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The
test now uses the llama stack client and the proper post_training API
runs one step with a batch_size of 1. This test runs on CPU on the
Ubuntu runner so it needs to be a small batch and a single step.
[//]: # (## Documentation)
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
**Fixes** #1959
HuggingFace provides several loading paths that the datasets library can
use. My theory on why the test would previously fail intermittently is
because when calling `load_dataset(...)`, it may be trying several
options such as local cache, Hugging Face Hub, or a dataset script, or
other. There's one of these options that seem to work inconsistently in
the CI.
The HuggingFace datasets library relies on the `transformers` package to
load certain datasets such as `llamastack/simpleqa`, and by adding the
package, we can see the dataset is loaded consistently via the Hugging
Face Hub.
Please see PR in my fork demonstrating over 7 consecutive passes:
https://github.com/ChristianZaccaria/llama-stack/pull/1
**Some References:**
- https://github.com/huggingface/transformers/issues/8690
- https://huggingface.co/docs/datasets/en/loading
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
# What does this PR do?
This commit adds a new authentication system to the Llama Stack server
with support for Kubernetes and custom authentication providers. Key
changes include:
- Implemented KubernetesAuthProvider for validating Kubernetes service
account tokens
- Implemented CustomAuthProvider for validating tokens against external
endpoints - this is the same code that was already present.
- Added test for Kubernetes
- Updated server configuration to support authentication settings
- Added documentation for authentication configuration and usage
The authentication system supports:
- Bearer token validation
- Kubernetes service account token validation
- Custom authentication endpoints
## Test Plan
Setup a Kube cluster using Kind or Minikube.
Run a server with:
```
server:
port: 8321
auth:
provider_type: kubernetes
config:
api_server_url: http://url
ca_cert_path: path/to/cert (optional)
```
Run:
```
curl -s -L -H "Authorization: Bearer $(kubectl create token my-user)" http://127.0.0.1:8321/v1/providers
```
Or replace "my-user" with your service account.
Signed-off-by: Sébastien Han <seb@redhat.com>
This resolves a new critical severity on h11. See
https://access.redhat.com/security/cve/cve-2025-43859. We should
consider releasing a new patch with this fix.
This was updated via:
```
uv add "h11>=0.16.0"
uv export --frozen --no-hashes --no-emit-project --output-file=requirements.txt
```
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
Previously, when a streaming client would disconnect before we were
finished streaming the entire response, an error like the below would
get raised from the `sse_generator` function in
`llama_stack/distribution/server/server.py`:
```
AttributeError: 'coroutine' object has no attribute 'aclose'. Did you mean: 'close'?
```
This was because we were calling `aclose` on a coroutine instead of the
awaited value from that coroutine. This change fixes that, so that we
save off the awaited value and then can call `aclose` on it if we
encounter an `asyncio.CancelledError`, like we see when a client
disconnects before we're finished streaming.
The other changes in here are to add a simple set of tests for the happy
path of our SSE streaming and this client disconnect path.
That unfortunately requires adding one more dependency into our unit
test section of pyproject.toml since `server.py` requires loading some
of the telemetry code for me to test this functionality.
## Test Plan
I wrote the tests in `tests/unit/server/test_sse.py` first, verified the
client disconnected test failed before my change, and that it passed
afterwards.
```
python -m pytest -s -v tests/unit/server/test_sse.py
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This stubs in some OpenAI server-side compatibility with three new
endpoints:
/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions
This gives common inference apps using OpenAI clients the ability to
talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .
The two "v1" instances in there isn't awesome, but the thinking is that
Llama Stack's API is v1 and then our OpenAI compatibility layer is
compatible with OpenAI V1. And, some OpenAI clients implicitly assume
the URL ends with "v1", so this gives maximum compatibility.
The openai models endpoint is implemented in the routing layer, and just
returns all the models Llama Stack knows about.
The following providers should be working with the new OpenAI
completions and chat/completions API:
* remote::anthropic (untested)
* remote::cerebras-openai-compat (untested)
* remote::fireworks (tested)
* remote::fireworks-openai-compat (untested)
* remote::gemini (untested)
* remote::groq-openai-compat (untested)
* remote::nvidia (tested)
* remote::ollama (tested)
* remote::openai (untested)
* remote::passthrough (untested)
* remote::sambanova-openai-compat (untested)
* remote::together (tested)
* remote::together-openai-compat (untested)
* remote::vllm (tested)
The goal to support this for every inference provider - proxying
directly to the provider's OpenAI endpoint for OpenAI-compatible
providers. For providers that don't have an OpenAI-compatible API, we'll
add a mixin to translate incoming OpenAI requests to Llama Stack
inference requests and translate the Llama Stack inference responses to
OpenAI responses.
This is related to #1817 but is a bit larger in scope than just chat
completions, as I have real use-cases that need the older completions
API as well.
## Test Plan
### vLLM
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```
### ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```
## Documentation
Run a Llama Stack distribution that uses one of the providers mentioned
in the list above. Then, use your favorite OpenAI client to send
completion or chat completion requests with the base_url set to
http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the
host and port of your Llama Stack server, if different.
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
* Manage UI deps in pyproject
* Use a new "ui" dep group to pull the deps with "uv"
* Simplify the run command
* Bump versions in requirements.txt
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Move pagination logic from LocalFS and HuggingFace implementations into
a common helper function to ensure consistent pagination behavior across
providers. This reduces code duplication and centralizes pagination
logic in one place.
## Test Plan
Run this script:
```
from llama_stack_client import LlamaStackClient
# Initialize the client
client = LlamaStackClient(base_url="http://localhost:8321")
# Register a dataset
response = client.datasets.register(
purpose="eval/messages-answer", # or "eval/question-answer" or "post-training/messages"
source={"type": "uri", "uri": "huggingface://datasets/llamastack/simpleqa?split=train"},
dataset_id="my_dataset", # optional, will be auto-generated if not provided
metadata={"description": "My evaluation dataset"}, # optional
)
# Verify the dataset was registered by listing all datasets
datasets = client.datasets.list()
print(f"Registered datasets: {[d.identifier for d in datasets]}")
# You can then access the data using the datasetio API
# rows = client.datasets.iterrows(dataset_id="my_dataset", start_index=1, limit=2)
rows = client.datasets.iterrows(dataset_id="my_dataset")
print(f"Data: {rows.data}")
```
And play with `start_index` and `limit`.
[//]: # (## Documentation)
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This PR adds support for NVIDIA's NeMo Customizer API to the Llama Stack
post-training module. The integration enables users to fine-tune models
using NVIDIA's cloud-based customization service through a consistent
Llama Stack interface.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
Yet to be done
Things pending under this PR:
- [x] Integration of fine-tuned model(new checkpoint) for inference with
nvidia llm distribution
- [x] distribution integration of API
- [x] Add test cases for customizer(In Progress)
- [x] Documentation
```
LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -v tests/client-sdk/post_training/test_supervised_fine_tuning.py
============================================================================================================================================================================ test session starts =============================================================================================================================================================================
platform linux -- Python 3.10.0, pytest-8.3.4, pluggy-1.5.0 -- /home/ubuntu/llama-stack/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.0', 'Platform': 'Linux-6.8.0-1021-gcp-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'nbval': '0.11.0', 'metadata': '3.1.1', 'anyio': '4.8.0', 'html': '4.1.1', 'asyncio': '0.25.3'}}
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: nbval-0.11.0, metadata-3.1.1, anyio-4.8.0, html-4.1.1, asyncio-0.25.3
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 2 items
tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_post_training_provider_registration[txt=8B] PASSED [ 50%]
tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_list_training_jobs[txt=8B] PASSED [100%]
======================================================================================================================================================================== 2 passed, 1 warning in 0.10s ========================================================================================================================================================================
```
cc: @mattf @dglogo @sumitb
---------
Co-authored-by: Ubuntu <ubuntu@llama-stack-customizer-dev-inst-2tx95fyisatvlic4we8hidx5tfj.us-central1-a.c.brevdevprod.internal>
# What does this PR do?
Removed local execution option from the remote Qdrant provider and
introduced an explicit inline provider for the embedded execution.
Updated the ollama template to include this option: this part can be
reverted in case we don't want to have two default `vector_io`
providers.
(Closes#1082)
## Test Plan
Build and run an ollama distro:
```bash
llama stack build --template ollama --image-type conda
llama stack run --image-type conda ollama
```
Run one of the sample ingestionapplicatinos like
[rag_with_vector_db.py](https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/rag_with_vector_db.py),
but replace this line:
```py
selected_vector_provider = vector_providers[0]
```
with the following, to use the `qdrant` provider:
```py
selected_vector_provider = vector_providers[1]
```
After running the test code, verify the timestamp of the Qdrant store:
```bash
% ls -ltr ~/.llama/distributions/ollama/qdrant.db/collection/test_vector_db_*
total 784
-rw-r--r--@ 1 dmartino staff 401408 Feb 26 10:07 storage.sqlite
```
[//]: # (## Documentation)
---------
Signed-off-by: Daniele Martinoli <dmartino@redhat.com>
Co-authored-by: Francisco Arceo <farceo@redhat.com>
# What does this PR do?
create a new dataset BFCL_v3 from
https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html
overall each question asks the model to perform a task described in
natural language, and additionally a set of available functions and
their schema are given for the model to choose from. the model is
required to write the function call form including function name and
parameters , to achieve the stated purpose. the results are validated
against provided ground truth, to make sure that the generated function
call and the ground truth function call are syntactically and
semantically equivalent, by checking their AST .
## Test Plan
start server by
```
llama stack run ./llama_stack/templates/ollama/run.yaml
```
then send traffic
```
llama-stack-client eval run-benchmark "bfcl" --model-id meta-llama/Llama-3.2-3B-Instruct --output-dir /tmp/gpqa --num-examples 2
```
[//]: # (## Documentation)
# What does this PR do?
The `test` section has been updated to include only the essential
dependencies needed for running integration tests, which are shared
across all providers. If a provider requires additional dependencies,
please add them to your environment separately. When using uv to
run your tests, you can specify extra dependencies with the
`--with` flag.
Signed-off-by: Sébastien Han <seb@redhat.com>