The test depends on llama's tool calling ability. In the CI, we run with
a small ollama model.
The fix might be to check for either message or function_call because
the model is flaky and we aren't really testing that behavior?
# What does this PR do?
This is not part of the official OpenAI API, but we'll use this for the
logs UI.
In order to support more filtering options, I'm adopting the newly
introduced sql store in in place of the kv store.
## Test Plan
Added integration/unit tests.
Having to run (and re-run) a server while running verifications can be
annoying while you are iterating on code. This makes it so you can use
the library client -- and because it is OpenAI client compatible, it all
works.
## Test Plan
```
pytest -s -v tests/verifications/openai_api/test_responses.py \
--provider=stack:together \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
The most interesting MCP servers are those with an authorization wall in
front of them. This PR uses the existing `provider_data` mechanism of
passing provider API keys for passing MCP access tokens (in fact,
arbitrary headers in the style of the OpenAI Responses API) from the
client through to the MCP server.
```
class MCPProviderDataValidator(BaseModel):
# mcp_endpoint => list of headers to send
mcp_headers: dict[str, list[str]] | None = None
```
Note how we must stuff the headers for all MCP endpoints into a single
"MCPProviderDataValidator". Unlike existing providers (e.g., Together
and Fireworks for inference) where we could name the provider api keys
clearly (`together_api_key`, `fireworks_api_key`), we cannot name these
keys for MCP. We have a single generic MCP provider which can serve
multiple "toolgroups". So we use a dict to combine all the headers for
all MCP endpoints you may want to use in an agentic call.
## Test Plan
See the added integration test for usage.
# What does this PR do?
* Provide sqlite implementation of the APIs introduced in
https://github.com/meta-llama/llama-stack/pull/2145.
* Introduced a SqlStore API: llama_stack/providers/utils/sqlstore/api.py
and the first Sqlite implementation
* Pagination support will be added in a future PR.
## Test Plan
Unit test on sql store:
<img width="1005" alt="image"
src="https://github.com/user-attachments/assets/9b8b7ec8-632b-4667-8127-5583426b2e29"
/>
Integration test:
```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" llama stack build --template ollama --image-type conda --run
```
```
LLAMA_STACK_CONFIG=http://localhost:5001 INFERENCE_MODEL="llama3.2:3b-instruct-fp16" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-fp16" -k 'inference_store and openai'
```
# What does this PR do?
This PR introduces support for keyword based FTS5 search with BM25
relevance scoring. It makes changes to the existing EmbeddingIndex base
class in order to support a search_mode and query_str parameter, that
can be used for keyword based search implementations.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
run
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
```
Output:
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
/Users/vnarsing/miniconda3/envs/stack-client/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
====================================================== test session starts =======================================================
platform darwin -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /Users/vnarsing/miniconda3/envs/stack-client/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform': 'macOS-14.7.4-arm64-arm-64bit', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0'}}
rootdir: /Users/vnarsing/go/src/github/meta-llama/llama-stack
configfile: pyproject.toml
plugins: html-4.1.1, metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0
asyncio: mode=auto, asyncio_default_fixture_loop_scope=None
collected 7 items
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_add_chunks PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_vector PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_fts PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_chunk_id_conflict PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_register_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_unregister_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_generate_chunk_id PASSED
```
For reference, with the implementation, the fts table looks like below:
```
Chunk ID: 9fbc39ce-c729-64a2-260f-c5ec9bb2a33e, Content: Sentence 0 from document 0
Chunk ID: 94062914-3e23-44cf-1e50-9e25821ba882, Content: Sentence 1 from document 0
Chunk ID: e6cfd559-4641-33ba-6ce1-7038226495eb, Content: Sentence 2 from document 0
Chunk ID: 1383af9b-f1f0-f417-4de5-65fe9456cc20, Content: Sentence 3 from document 0
Chunk ID: 2db19b1a-de14-353b-f4e1-085e8463361c, Content: Sentence 4 from document 0
Chunk ID: 9faf986a-f028-7714-068a-1c795e8f2598, Content: Sentence 5 from document 0
Chunk ID: ef593ead-5a4a-392f-7ad8-471a50f033e8, Content: Sentence 6 from document 0
Chunk ID: e161950f-021f-7300-4d05-3166738b94cf, Content: Sentence 7 from document 0
Chunk ID: 90610fc4-67c1-e740-f043-709c5978867a, Content: Sentence 8 from document 0
Chunk ID: 97712879-6fff-98ad-0558-e9f42e6b81d3, Content: Sentence 9 from document 0
Chunk ID: aea70411-51df-61ba-d2f0-cb2b5972c210, Content: Sentence 0 from document 1
Chunk ID: b678a463-7b84-92b8-abb2-27e9a1977e3c, Content: Sentence 1 from document 1
Chunk ID: 27bd63da-909c-1606-a109-75bdb9479882, Content: Sentence 2 from document 1
Chunk ID: a2ad49ad-f9be-5372-e0c7-7b0221d0b53e, Content: Sentence 3 from document 1
Chunk ID: cac53bcd-1965-082a-c0f4-ceee7323fc70, Content: Sentence 4 from document 1
```
Query results:
Result 1: Sentence 5 from document 0
Result 2: Sentence 5 from document 1
Result 3: Sentence 5 from document 2
[//]: # (## Documentation)
---------
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
The cache_ttl config value is not in fact tied to the lifetime of any of
the keys, it represents the time interval between for our key cache
refresher.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Kubernetes since 1.20 exposes a JWKS endpoint that we can use with our
recent oauth2 recent implementation.
The CI test has been kept intact for validation.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
feat(quota): add server‑side per‑client request quotas (requires auth)
Unrestricted usage can lead to runaway costs and fragmented client-side
workarounds. This commit introduces a native quota mechanism to the
server, giving operators a unified, centrally managed throttle for
per-client requests—without needing extra proxies or custom client
logic. This helps contain cloud-compute expenses, enables fine-grained
usage control, and simplifies deployment and monitoring of Llama Stack
services. Quotas are fully opt-in and have no effect unless explicitly
configured.
Notice that Quotas are fully opt-in and require authentication to be
enabled. The 'sqlite' is the only supported quota `type` at this time,
any other `type` will be rejected. And the only supported `period` is
'day'.
Highlights:
- Adds `QuotaMiddleware` to enforce per-client request quotas:
- Uses `Authorization: Bearer <client_id>` (from
AuthenticationMiddleware)
- Tracks usage via a SQLite-based KV store
- Returns 429 when the quota is exceeded
- Extends `ServerConfig` with a `quota` section (type + config)
- Enforces strict coupling: quotas require authentication or the server
will fail to start
Behavior changes:
- Quotas are disabled by default unless explicitly configured
- SQLite defaults to `./quotas.db` if no DB path is set
- The server requires authentication when quotas are enabled
To enable per-client request quotas in `run.yaml`, add:
```
server:
port: 8321
auth:
provider_type: "custom"
config:
endpoint: "https://auth.example.com/validate"
quota:
type: sqlite
config:
db_path: ./quotas.db
limit:
max_requests: 1000
period: day
[//]: # (If resolving an issue, uncomment and update the line below)
Closes#2093
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Wen Liang <wenliang@redhat.com>
Co-authored-by: Wen Liang <wenliang@redhat.com>
# What does this PR do?
This adds an alternative option to the oauth_token auth provider that
can be used with existing authorization services which support token
introspection as defined in RFC 7662. This could be useful where token
revocation needs to be handled or where opaque tokens (or other non jwt
formatted tokens) are used
## Test Plan
Tested against keycloak
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
Add support for "instructions" to the responses API. Instructions
provide a way to swap out system (or developer) messages in new
responses.
## Test Plan
unit tests added
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
When launching a fine-tuning job, an upcoming version of NeMo Customizer
will expect the `config` name to be formatted as
`namespace/name@version`. Here, `config` is a reference to a model +
additional metadata. There could be multiple `config`s that reference
the same base model.
This PR updates NVIDIA's `supervised_fine_tune` to simply pass the
`model` param as-is to NeMo Customizer. Currently, it expects a
specific, allowlisted llama model (i.e. `meta/Llama3.1-8B-Instruct`) and
converts it to the provider format (`meta/llama-3.1-8b-instruct`).
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
From a notebook, I built an image with my changes:
```
!llama stack build --template nvidia --image-type venv
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("nvidia")
client.initialize()
```
And could successfully launch a job:
```
response = client.post_training.supervised_fine_tune(
job_uuid="",
model="meta/llama-3.2-1b-instruct@v1.0.0+A100", # Model passed as-is to Customimzer
...
)
job_id = response.job_uuid
print(f"Created job with ID: {job_id}")
Output:
Created job with ID: cust-Jm4oGmbwcvoufaLU4XkrRU
```
[//]: # (## Documentation)
---------
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
This PR adds a notion of `principal` (aka some kind of persistent
identity) to the authentication infrastructure of the Stack. Until now
we only used access attributes ("claims" in the more standard OAuth /
OIDC setup) but we need the notion of a User fundamentally as well.
(Thanks @rhuss for bringing this up.)
This value is not yet _used_ anywhere downstream but will be used to
segregate access to resources.
In addition, the PR introduces a built-in JWT token validator so the
Stack does not need to contact an authentication provider to validating
the authorization and merely check the signed token for the represented
claims. Public keys are refreshed via the configured JWKS server. This
Auth Provider should overwhelmingly be considered the default given the
seamless integration it offers with OAuth setups.
# What does this PR do?
adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a
super popular option for running training jobs. The config allows a user
to specify some key fields such as a model, chat_template, device, etc
the provider comes with one recipe `finetune_single_device` which works
both with and without LoRA.
any model that is a valid HF identifier can be given and the model will
be pulled.
this has been tested so far with CPU and MPS device types, but should be
compatible with CUDA out of the box
The provider processes the given dataset into the proper format,
establishes the various steps per epoch, steps per save, steps per eval,
sets a sane SFTConfig, and runs n_epochs of training
if checkpoint_dir is none, no model is saved. If there is a checkpoint
dir, a model is saved every `save_steps` and at the end of training.
## Test Plan
re-enabled post_training integration test suite with a singular test
that loads the simpleqa dataset:
https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite
model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The
test now uses the llama stack client and the proper post_training API
runs one step with a batch_size of 1. This test runs on CPU on the
Ubuntu runner so it needs to be a small batch and a single step.
[//]: # (## Documentation)
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
## Test Plan
LLAMA_STACK_CONFIG=http://localhost:8321 pytest
tests/integration/tool_runtime/test_rag_tool.py --embedding-model
text-embedding-3-small
# What does this PR do?
This fixes an issue in how we used the tool_call_buf from streaming tool
calls in the remote-vllm provider where it would end up concatenating
parameters from multiple different tool call results instead of
aggregating the results from each tool call separately.
It also fixes an issue found while digging into that where we were
accidentally mixing the json string form of tool call parameters with
the string representation of the python form, which mean we'd end up
with single quotes in what should be double-quoted json strings.
Closes#1120
## Test Plan
The following tests are now passing 100% for the remote-vllm provider,
where some of the test_text_inference were failing before this change:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_text_inference.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/inference/test_vision_inference.py --vision-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
```
All but one of the agent tests are passing (including the multi-tool
one). See the PR at https://github.com/vllm-project/vllm/pull/17917 and
a gist at
https://gist.github.com/bbrowning/4734240ce96b4264340caa9584e47c9e for
changes needed there, which will have to get made upstream in vLLM.
Agent tests:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/integration/agents/test_agents.py --text-model "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
````
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
currently the "default" dir for external providers is
`/etc/llama-stack/providers.d`
This dir is not used anywhere nor created.
Switch to a more friendly `~/.llama/providers.d/`
This allows external providers to actually create this dir and/or
populate it upon installation, `pip` cannot create directories in `etc`.
If a user does not specify a dir, default to this one
see https://github.com/containers/ramalama-stack/issues/36
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
his PR allows users to customize the template used for chunks when
inserted into the context. Additionally, this enables metadata injection
into the context of an LLM for RAG. This makes a naive and crude
assumption that each chunk should include the metadata, this is
obviously redundant when multiple chunks are returned from the same
document. In order to remove any sort of duplication of chunks, we'd
have to make much more significant changes so this is a reasonable first
step that unblocks users requesting this enhancement in
https://github.com/meta-llama/llama-stack/issues/1767.
In the future, this can be extended to support citations.
List of Changes:
- `llama_stack/apis/tools/rag_tool.py`
- Added `chunk_template` field in `RAGQueryConfig`.
- Added `field_validator` to validate the `chunk_template` field in
`RAGQueryConfig`.
- Ensured the `chunk_template` field includes placeholders `{index}` and
`{chunk.content}`.
- Updated the `query` method to use the `chunk_template` for formatting
chunk text content.
- `llama_stack/providers/inline/tool_runtime/rag/memory.py`
- Modified the `insert` method to pass `doc.metadata` for chunk
creation.
- Enhanced the `query` method to format results using `chunk_template`
and exclude unnecessary metadata fields like `token_count`.
- `llama_stack/providers/utils/memory/vector_store.py`
- Updated `make_overlapped_chunks` to include metadata serialization and
token count for both content and metadata.
- Added error handling for metadata serialization issues.
- `pyproject.toml`
- Added `pydantic.field_validator` as a recognized `classmethod`
decorator in the linting configuration.
- `tests/integration/tool_runtime/test_rag_tool.py`
- Refactored test assertions to separate `assert_valid_chunk_response`
and `assert_valid_text_response`.
- Added integration tests to validate `chunk_template` functionality
with and without metadata inclusion.
- Included a test case to ensure `chunk_template` validation errors are
raised appropriately.
- `tests/unit/rag/test_vector_store.py`
- Added unit tests for `make_overlapped_chunks`, verifying chunk
creation with overlapping tokens and metadata integrity.
- Added tests to handle metadata serialization errors, ensuring proper
exception handling.
- `docs/_static/llama-stack-spec.html`
- Added a new `chunk_template` field of type `string` with a default
template for formatting retrieved chunks in RAGQueryConfig.
- Updated the `required` fields to include `chunk_template`.
- `docs/_static/llama-stack-spec.yaml`
- Introduced `chunk_template` field with a default value for
RAGQueryConfig.
- Updated the required configuration list to include `chunk_template`.
- `docs/source/building_applications/rag.md`
- Documented the `chunk_template` configuration, explaining how to
customize metadata formatting in RAG queries.
- Added examples demonstrating the usage of the `chunk_template` field
in RAG tool queries.
- Highlighted default values for `RAG` agent configurations.
# Resolves https://github.com/meta-llama/llama-stack/issues/1767
## Test Plan
Updated both `test_vector_store.py` and `test_rag_tool.py` and tested
end-to-end with a script.
I also tested the quickstart to enable this and specified this metadata:
```python
document = RAGDocument(
document_id="document_1",
content=source,
mime_type="text/html",
metadata={"author": "Paul Graham", "title": "How to do great work"},
)
```
Which produced the output below:

This highlights the usefulness of the additional metadata. Notice how
the metadata is redundant for different chunks of the same document. I
think we can update that in a subsequent PR.
# Documentation
I've added a brief comment about this in the documentation to outline
this to users and updated the API documentation.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
In the Responses API, we convert incoming response requests to chat
completion requests. When streaming the resulting chunks of those chat
completion requests, inference providers that use OpenAI clients will
often return a `type=None` value in the tool call parts of the response.
This causes issues when we try to dump and load that response into our
pydantic model, because type cannot be None in the Responses API model
we're loading these into.
So, strip the "type" field, if present, off those chat completion tool
call results before dumping and loading them as our typed pydantic
models, which will apply our default value for that type field.
## Test Plan
This was found via manual testing of the Responses API with codex, where
I was getting errors in some tool call situations. I added a unit test
to simulate this scenario and verify the fix, as well as manual codex
testing to verify the fix.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Closes#2113.
Closes#1783.
Fixes a bug in handling the end of tool execution request stream where
no `finish_reason` is provided by the model.
## Test Plan
1. Ran existing unit tests
2. Added a dedicated test verifying correct behavior in this edge case
3. Ran the code snapshot from #2113
[//]: # (## Documentation)
# What does this PR do?
Closes#2111.
Fixes an error causing Llama Stack to just return `<tool_call>` and
complete the turn without actually executing the tool. See the issue
description for more detail.
## Test Plan
1) Ran existing unit tests
2) Added a dedicated test verifying correct behavior in this edge case
3) Ran the code snapshot from #2111
Prevent it from returning results about
'LT Wright Maverick Scout' knives. Ultimatly
we want the word "model" in the returned results
putting llm in the search term make this more likely.
Closes: #2150
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
This is a combination of what was previously 3 separate PRs - #2069,
#2075, and #2083. It turns out all 3 of those are needed to land a
working function calling Responses implementation. The web search
builtin tool was already working, but this wires in support for custom
function calling.
I ended up combining all three into one PR because they all had lots of
merge conflicts, both with each other but also with #1806 that just
landed. And, because landing any of them individually would have only
left a partially working implementation merged.
The new things added here are:
* Storing of input items from previous responses and restoring of those
input items when adding previous responses to the conversation state
* Handling of multiple input item messages roles, not just "user"
messages.
* Support for custom tools passed into the Responses API to enable
function calling outside of just the builtin websearch tool.
Closes#2074Closes#2080
## Test Plan
### Unit Tests
Several new unit tests were added, and they all pass. Ran via:
```
python -m pytest -s -v tests/unit/providers/agents/meta_reference/test_openai_responses.py
```
### Responses API Verification Tests
I ran our verification run.yaml against multiple providers to ensure we
were getting a decent pass rate. Specifically, I ensured the new custom
tool verification test passed across multiple providers and that the
multi-turn examples passed across at least some of the providers (some
providers struggle with the multi-turn workflows still).
Running the stack setup for verification testing:
```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```
Together, passing 100% as an example:
```
pytest -s -v 'tests/verifications/openai_api/test_responses.py' --provider=together-llama-stack
```
## Documentation
We will need to start documenting the OpenAI APIs, but for now the
Responses stuff is still rapidly evolving so delaying that.
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
reduces duplication and centralizes information to be easier to find for
contributors
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
While adding other tests, I came across this and wasn’t sure how useful
it is. It doesn’t seem to be exercised anywhere in CI, but I figured I’d
fix it anyway. Happy to remove it if preferred. :)
## Test Plan
Run:
```
uv run pytest tests/integration/inference --stack-config=ollama --report=test_report.md -v --text-model="llama3.2:3b" --embedding-model=all-MiniLM-L6-v2
```
Look at the produced `test_report.md`.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This PR fixes the behavior of the `/tool-runtime/rag-tool/query`
endpoint when invoked with an empty `vector_db_ids` parameter.
As of now, it simply returns an empty result, which leads to a
misleading error message from the server and makes it difficult and
time-consuming to detect the problem with the input parameter.
The proposed fix is to return an indicative error message in this case.
## Test Plan
Running the following script:
```
agent = Agent(
client,
model=MODEL_ID,
instructions=SYSTEM_PROMPT,
tools=[
dict(
name="builtin::rag/knowledge_search",
args={
"vector_db_ids": [],
},
)
],
)
response = agent.create_turn(
messages=[
{
"role": "user",
"content": "How to install OpenShift?",
}
],
session_id=agent.create_session(f"rag-session")
)
```
results in the following error message in the non-patched version:
```
{"type": "function", "name": "knowledge_search", "parameters": {"query": "installing OpenShift"}}400: Invalid value: Tool call result (id: 494b8020-90bb-449b-aa76-10960d6b2cc2, name: knowledge_search) does not have any content
```
and in the following one in the patched version:
```
{"type": "function", "name": "knowledge_search", "parameters": {"query": "installing OpenShift"}}400: Invalid value: No vector DBs were provided to the RAG tool. Please provide at least one DB.
```
# What does this PR do?
switch sambanova inference adaptor to LiteLLM usage to simplify
integration and solve issues with current adaptor when streaming and
tool calling, models and templates updated
## Test Plan
pytest -s -v tests/integration/inference/test_text_inference.py
--stack-config=sambanova
--text-model=sambanova/Meta-Llama-3.3-70B-Instruct
pytest -s -v tests/integration/inference/test_vision_inference.py
--stack-config=sambanova
--vision-model=sambanova/Llama-3.2-11B-Vision-Instruct
# What does this PR do?
Checks for RAGDocument of type InterleavedContent
I noticed when stepping through the code that the supported types for
`RAGDocument` included `InterleavedContent` as a content type. This type
is not checked against before putting the `doc.content` is regex matched
against. This would cause a runtime error. This change adds an explicit
check for type.
The only other part that I'm unclear on is how to handle the
`ImageContent` type since this would always just return `<image>` which
seems like an undesired behavior. Should the `InterleavedContent` type
be removed from `RAGDocument` and replaced with `URI | str`?
## Test Plan
[//]: # (## Documentation)
---------
Signed-off-by: Kevin <kpostlet@redhat.com>
# What does this PR do?
**Fixes** #1959
HuggingFace provides several loading paths that the datasets library can
use. My theory on why the test would previously fail intermittently is
because when calling `load_dataset(...)`, it may be trying several
options such as local cache, Hugging Face Hub, or a dataset script, or
other. There's one of these options that seem to work inconsistently in
the CI.
The HuggingFace datasets library relies on the `transformers` package to
load certain datasets such as `llamastack/simpleqa`, and by adding the
package, we can see the dataset is loaded consistently via the Hugging
Face Hub.
Please see PR in my fork demonstrating over 7 consecutive passes:
https://github.com/ChristianZaccaria/llama-stack/pull/1
**Some References:**
- https://github.com/huggingface/transformers/issues/8690
- https://huggingface.co/docs/datasets/en/loading
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Add fixtures for SqliteKVStore, DiskDistributionRegistry and
CachedDiskDistributionRegistry. And use them in tests that had all been
duplicating similar setups.
## Test Plan
unit tests continue to run
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
We've disabled it for a while given that this hasn't worked as well as
expected given the frequent changes of llama_stack_client and how this
requires both repos to be in sync.
## Test Plan
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
When converting OpenAI message content for the "system" and "assistant"
roles to Llama Stack inference APIs (used for some providers when
dealing with Llama models via OpenAI API requests to get proper prompt /
tool handling), we were not properly converting any non-string content.
I discovered this while running the new Responses AI verification suite
against the Fireworks provider, but instead of fixing it as part of some
ongoing work there split this out into a separate PR.
This fixes that, by using the `openai_content_to_content` helper we used
elsewhere to ensure content parts were mapped properly.
## Test Plan
I added a couple of new tests to `test_openai_compat` to reproduce this
issue and validate its fix. I ran those as below:
```
python -m pytest -s -v tests/unit/providers/utils/inference/test_openai_compat.py
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
The builtin implementation of code interpreter is not robust and has a
really weak sandboxing shell (the `bubblewrap` container). Given the
availability of better MCP code interpreter servers coming up, we should
use them instead of baking an implementation into the Stack and
expanding the vulnerability surface to the rest of the Stack.
This PR only does the removal. We will add examples with how to
integrate with MCPs in subsequent ones.
## Test Plan
Existing tests.
# What does this PR do?
The goal of this PR is code base modernization.
Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)
Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
Add support for the temperature to the responses API
## Test Plan
Manually tested simple case
unit tests added for simple case and tool calls
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
When the result of a ToolCall gets passed back into vLLM for the model
to handle the tool call result (as is often the case in agentic
tool-calling workflows), we forgot to handle the case where BuiltinTool
calls are not string values but instead instances of the BuiltinTool
enum. This fixes that, properly converting those enums to string values
before trying to serialize them into an OpenAI chat completion request
to vLLM.
PR #1931 fixed a bug where we weren't passing these tool calling results
back into vLLM, but as a side-effect it created this serialization bug
when using BuiltinTools.
Closes#2070
## Test Plan
I added a new unit test to the openai_compat unit tests to cover this
scenario, ensured the new test failed before this fix, and all the
existing tests there plus the new one passed with this fix.
```
python -m pytest -s -v tests/unit/providers/utils/inference/test_openai_compat.py
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Add several new pre-commit hooks to improve code quality and security:
- no-commit-to-branch: prevent direct commits to protected branches like
`main`
- check-yaml: validate YAML files
- detect-private-key: prevent accidental commit of private keys
- requirements-txt-fixer: maintain consistent requirements.txt format
and sorting
- mixed-line-ending: enforce LF line endings to avoid mixed line endings
- check-executables-have-shebangs: ensure executable scripts have
shebangs
- check-json: validate JSON files
- check-shebang-scripts-are-executable: verify shebang scripts are
executable
- check-symlinks: validate symlinks and report broken ones
- check-toml: validate TOML files mainly for pyproject.toml
The respective fixes have been included.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
When running a Llama Stack server and invoking the
`/v1/safety/run-shield` endpoint, the NVIDIA Guardrails endpoint in some
cases errors with a `422: Unprocessable Entity` due to malformed input.
For example, given an request body like:
```
{
"model": "test",
"messages": [
{ "role": "user", "content": "You are stupid." }
]
}
```
`convert_pydantic_to_json_value` converts the message to:
```
{ "role": "user", "content": "You are stupid.", "context": null }
```
Which causes NVIDIA Guardrails to return an error `HTTPError: 422 Client
Error: Unprocessable Entity for url:
http://nemo.test/v1/guardrail/checks`, because `context` shouldn't be
included in the body.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
I ran the Llama Stack server locally and manually verified that the
endpoint now succeeds.
```
message = {"role": "user", "content": "You are stupid."}
response = client.safety.run_shield(messages=[message], shield_id=shield_id, params={})
```
Server logs:
```
14:29:09.656 [START] /v1/safety/run-shield
INFO: 127.0.0.1:54616 - "POST /v1/safety/run-shield HTTP/1.1" 200 OK
14:29:09.918 [END] /v1/safety/run-shield [StatusCode.OK] (262.26ms
```
[//]: # (## Documentation)
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>