# What does this PR do?
Partial revert of fa68ded07c
this commit ensures users know where their new templates are generated
and how to run the newly built distro locally
discussion on Discord:
1351652390
## Test Plan
Did a local run - let me know if we want any unit testing covering this

## Documentation
Updated "Zero to Hero" guide with new output
---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
- Added new Ruff lint rules to detect ambiguous or non-ASCII characters:
- Added per-file ignores where Unicode usage is still required.
- Fixed whatever had to be fixed
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
When running a Llama Stack server and invoking the
`/v1/safety/run-shield` endpoint, the NVIDIA Guardrails endpoint in some
cases errors with a `422: Unprocessable Entity` due to malformed input.
For example, given an request body like:
```
{
"model": "test",
"messages": [
{ "role": "user", "content": "You are stupid." }
]
}
```
`convert_pydantic_to_json_value` converts the message to:
```
{ "role": "user", "content": "You are stupid.", "context": null }
```
Which causes NVIDIA Guardrails to return an error `HTTPError: 422 Client
Error: Unprocessable Entity for url:
http://nemo.test/v1/guardrail/checks`, because `context` shouldn't be
included in the body.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
I ran the Llama Stack server locally and manually verified that the
endpoint now succeeds.
```
message = {"role": "user", "content": "You are stupid."}
response = client.safety.run_shield(messages=[message], shield_id=shield_id, params={})
```
Server logs:
```
14:29:09.656 [START] /v1/safety/run-shield
INFO: 127.0.0.1:54616 - "POST /v1/safety/run-shield HTTP/1.1" 200 OK
14:29:09.918 [END] /v1/safety/run-shield [StatusCode.OK] (262.26ms
```
[//]: # (## Documentation)
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
Replaced `${env.OTEL_SERVICE_NAME:\u200B}` and similar variants with
properly formatted `${env.OTEL_SERVICE_NAME:}` across all YAML templates
and TelemetryConfig. This prevents silent parsing issues and ensures
consistent environment variable resolution.
Slipped in https://github.com/meta-llama/llama-stack/pull/2058
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
* pull the embedding model so that it's not pulled during the distro
server startup sequence
* cache the models
* collect logs at the end of the workflow
Signed-off-by: Sébastien Han <seb@redhat.com>
Distribution Template Codegen was broken
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
* new workflow job **build-ubi9-container-distribution**
* runs on the default `ubuntu-latest` runner
* uses the existing `dev` template
* invokes `uv run llama stack build` with `.container_base =
"registry.access.redhat.com/ubi9/ubi-minimal:latest"`
* inspects the resulting image to verify its entrypoint
# (Closes#1994)
## Test Plan
- CI now includes the `build-ubi9-container-distribution` job and will
turn green when that job passes on changes to build files
# What does this PR do?
The telemetry provider configs is the only one who leverages the env var
`SQLITE_DB_PATH` for pointing to persistent data in the respective
templates, whereas usually `SQLITE_STORE_DIR` is used.
This PR modifies the `sqlite_db_path` in various telemetry configuration
files to use the environment variable `SQLITE_STORE_DIR` instead of
`SQLITE_DB_PATH`. This change ensures that _only_ the SQLITE_STORE_DIR
needs to be set to point to a different persistence location for
providers.
All references to `SQLITE_DB_PATH` have been removed.
Another improvement could be to move `sqlite_db_path` to `db_path` in
the telemetry provider config, to align with the other provider
configurations. That could be done by another PR (if wanted).
# What does this PR do?
This builds on top of
https://github.com/meta-llama/llama-stack/pull/2037 to include some
additional changes to fix integration tests builds.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
# What does this PR do?
In our OpenAI API verification tests, ollama was still calling tools
even when `tool_choice="none"` was passed in its chat completion
requests. Because ollama isn't respecting `tool_choice` properly, this
adjusts our provider implementation to remove the `tools` from the
request if `tool_choice="none"` is passed in so that it does not attempt
to call any of those tools.
## Test Plan
I tested this with a couple of Llama models, using both our OpenAI
completions integration tests and our verification test suites.
### OpenAI Completions / Chat Completions integration tests
These all passed before, and still do.
```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" \
llama stack build --template ollama --image-type venv --run
```
```
LLAMA_STACK_CONFIG=http://localhost:8321 \
python -m pytest -v \
tests/integration/inference/test_openai_completion.py \
--text-model "llama3.2:3b-instruct-fp16"
```
### OpenAI API Verification test suite
test_chat_*_tool_choice_none OpenAI API verification tests pass now,
when they failed before.
See
https://github.com/bbrowning/llama-stack-tests/blob/main/openai-api-verification/2025-04-27.md#ollama-llama-stack
for an example of these failures from a recent nightly CI run.
```
INFERENCE_MODEL="llama3.3:70b-instruct-q3_K_M" \
llama stack build --template ollama --image-type venv --run
```
```
cat <<-EOF > tests/verifications/conf/ollama-llama-stack.yaml
base_url: http://localhost:8321/v1/openai/v1
api_key_var: OPENAI_API_KEY
models:
- llama3.3:70b-instruct-q3_K_M
model_display_names:
llama3.3:70b-instruct-q3_K_M: Llama-3.3-70B-Instruct
test_exclusions:
llama3.3:70b-instruct-q3_K_M:
- test_chat_non_streaming_image
- test_chat_streaming_image
- test_chat_multi_turn_multiple_images
EOF
```
```
python -m pytest -s -v \
'tests/verifications/openai_api/test_chat_completion.py' \
--provider=ollama-llama-stack
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR updates how the `AgentType` gets set using the radio button on
the tools page of the playground. This change is needed due to the fact
with its current implementation, the chat interface will resets after
every input, preventing users from having a multi-turn conversation with
the agent.
## Test Plan
Run the Playground without these changes:
```bash
streamlit run llama_stack/distribution/ui/app.py
```
Navigate to the tools page and attempt to have a multi-turn
conversation. You should see the conversation reset after asking a
second question.
Repeat the steps above with these changes and you will see that it works
as expected when asking the agent multiple questions.
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
# What does this PR do?
This provides an initial [OpenAI Responses
API](https://platform.openai.com/docs/api-reference/responses)
implementation. The API is not yet complete, and this is more a
proof-of-concept to show how we can store responses in our key-value
stores and use them to support the Responses API concepts like
`previous_response_id`.
## Test Plan
I've added a new
`tests/integration/openai_responses/test_openai_responses.py` as part of
a test-driven development for this new API. I'm only testing this
locally with the remote-vllm provider for now, but it should work with
any of our inference providers since the only API it requires out of the
inference provider is the `openai_chat_completion` endpoint.
```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack build --template remote-vllm --image-type venv --run
```
```
LLAMA_STACK_CONFIG="http://localhost:8321" \
python -m pytest -v \
tests/integration/openai_responses/test_openai_responses.py \
--text-model "meta-llama/Llama-3.2-3B-Instruct"
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
This commit adds a new authentication system to the Llama Stack server
with support for Kubernetes and custom authentication providers. Key
changes include:
- Implemented KubernetesAuthProvider for validating Kubernetes service
account tokens
- Implemented CustomAuthProvider for validating tokens against external
endpoints - this is the same code that was already present.
- Added test for Kubernetes
- Updated server configuration to support authentication settings
- Added documentation for authentication configuration and usage
The authentication system supports:
- Bearer token validation
- Kubernetes service account token validation
- Custom authentication endpoints
## Test Plan
Setup a Kube cluster using Kind or Minikube.
Run a server with:
```
server:
port: 8321
auth:
provider_type: kubernetes
config:
api_server_url: http://url
ca_cert_path: path/to/cert (optional)
```
Run:
```
curl -s -L -H "Authorization: Bearer $(kubectl create token my-user)" http://127.0.0.1:8321/v1/providers
```
Or replace "my-user" with your service account.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Implemetation of NeMO Datastore register, unregister API.
Open Issues:
- provider_id gets set to `localfs` in client.datasets.register() as it
is specified in routing_tables.py: DatasetsRoutingTable
see: #1860
Currently I have passed `"provider_id":"nvidia"` in metadata and have
parsed that in `DatasetsRoutingTable`
(Not the best approach, but just a quick workaround to make it work for
now.)
## Test Plan
- Unit test cases: `pytest
tests/unit/providers/nvidia/test_datastore.py`
```bash
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, asyncio-0.26.0, nbval-0.11.0, metadata-3.1.1, html-4.1.1, cov-6.1.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items
tests/unit/providers/nvidia/test_datastore.py .. [100%]
============================================================ warnings summary ============================================================
====================================================== 2 passed, 1 warning in 0.84s ======================================================
```
cc: @dglogo, @mattf, @yanxi0830
# What does this PR do?
Add installation script for Llama Stack Meta Reference distro (Docker
only).
# Closes#1374
## Test Plan
./instal.sh
---------
Co-authored-by: Sébastien Han <seb@redhat.com>
This resolves a new critical severity on h11. See
https://access.redhat.com/security/cve/cve-2025-43859. We should
consider releasing a new patch with this fix.
This was updated via:
```
uv add "h11>=0.16.0"
uv export --frozen --no-hashes --no-emit-project --output-file=requirements.txt
```
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
There are new changes in repo which needs to add some additional
functions to the inference which is fixed. Also need one additional
params to pass some extra arguments to watsonx.ai
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Co-authored-by: Sajikumar JS <sajikumar.js@ibm.com>
# What does this PR do?
This addresses 2 bugs I ran into when launching a fine-tuning job with
the NVIDIA Adapter:
1. Session handling in `_make_request` helper function returns an error.
```
INFO: 127.0.0.1:55831 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 500 Internal Server Error
16:11:45.643 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (270.44ms)
16:11:45.643 [ERROR] Error executing endpoint route='/v1/post-training/supervised-fine-tune' method='post'
Traceback (most recent call last):
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 201, in endpoint
return await maybe_await(value)
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 161, in maybe_await
return await value
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 408, in supervised_fine_tune
response = await self._make_request(
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 98, in _make_request
async with self.session.request(method, url, params=params, json=json, **kwargs) as response:
File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 1425, in __aenter__
self._resp: _RetType = await self._coro
File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 579, in _request
handle = tm.start()
File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/helpers.py", line 587, in start
return self._loop.call_at(when, self.__call__)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 724, in call_at
self._check_closed()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 510, in _check_closed
raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed
```
Note: This only occurred when initializing the client like so:
```
client = LlamaStackClient(
base_url="http://0.0.0.0:8321"
)
response = client.post_training.supervised_fine_tune(...) # Returns error
```
I didn't run into this issue when using the library client:
```
client = LlamaStackAsLibraryClient("nvidia")
client.initialize()
response = client.post_training.supervised_fine_tune(...) # Works fine
```
2. The `algorithm_config` param in `supervised_fine_tune` is parsed as a
`dict` when run from unit tests, but a Pydantic model when invoked using
the Llama Stack client. So, the call fails outside of unit tests:
```
INFO: 127.0.0.1:54024 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 500 Internal Server Error
21:14:02.315 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (71.18ms)
21:14:02.314 [ERROR] Error executing endpoint route='/v1/post-training/supervised-fine-tune' method='post'
Traceback (most recent call last):
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 205, in endpoint
return await maybe_await(value)
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 164, in maybe_await
return await value
File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 407, in supervised_fine_tune
"adapter_dim": algorithm_config.get("adapter_dim"),
File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/pydantic/main.py", line 891, in __getattr__
raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'LoraFinetuningConfig' object has no attribute 'get'
```
The code assumes `algorithm_config` should be `dict`, so I just handle
both cases.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
1. I ran a local Llama Stack server with the necessary env vars:
```
lama stack run llama_stack/templates/nvidia/run.yaml --port 8321 --env ...
```
And invoked `supervised_fine_tune` to confirm neither of the errors
above occur.
```
client = LlamaStackClient(
base_url="http://0.0.0.0:8321"
)
response = client.post_training.supervised_fine_tune(...)
```
2. I confirmed the unit tests still pass: `./scripts/unit-tests.sh
tests/unit/providers/nvidia/test_supervised_fine_tuning.py`
[//]: # (## Documentation)
---------
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
## Test Plan
LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v
tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B
--vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model
meta-llama/Llama-4-Scout-17B-16E-Instruct
adding the --gpu all flag to Docker run commands
for meta-reference-gpu distributions ensures models are loaded into GPU
instead of CPU.
Remove docs for meta-reference-quantized-gpu
The distribution was removed in #1887
but these files were left behind.
Fixes: #1798
# What does this PR do?
Fixes doc to add --gpu all command to docker run
[//]: # (If resolving an issue, uncomment and update the line below)
Closes#1798
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
verified in docker documentation but untested
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Introduce a `.coveragerc` file to omit:
- test files (*/tests/*)
- provider code (*/llama_stack/providers/*)
- template files (*/llama_stack/templates/*)
- virtual environment (.venv/*)
This ensures coverage reports focus on core application logic (API and
CLI).
Note: I'm opening this for discussing as well - we might decide to
ignore more and or re-add some directories!
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
IBM watsonx ai added as the inference [#1741
](https://github.com/meta-llama/llama-stack/issues/1741)
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
---------
Co-authored-by: Sajikumar JS <sajikumar.js@ibm.com>
# What does this PR do?
Enhances the user experience in the `llama stack build` command by
adding interactive TAB completion for image type selection. This ensures
the UX consistency with other parts of the CLI that already support tab
completion, such as provider selection, providing a more intuitive and
discoverable interface for users.
<img width="1531" alt="image"
src="https://github.com/user-attachments/assets/12161d45-451d-4820-b34d-7ea4decf810f"
/>
# What does this PR do?
This PR improves the Tools page in the LlamaStack Playground UI by
enhancing the readability of the active tool list shown in the sidebar.
- Previously, active tools were displayed in a flat JSON array with
verbose identifiers (e.g., builtin::code_interpreter:code_interpreter).
- This PR updates the logic to group tools by their toolgroup (e.g.,
builtin::websearch) and renders each tool name in a simplified,
human-readable format (e.g., web_search).
- This change improves usability when working with multiple toolgroups,
especially in configurations involving MCP tools or complex tool
identifiers.
Before and After Comparison:
**Before**

**After**

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
- Followed the [LlamaStack UI Developer Setup
instructions](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/distribution/ui)
- Ran the Streamlit UI via: `uv run --with "[.ui]" streamlit run
llama_stack/distribution/ui/app.py`
- Selected multiple built-in toolgroups (e.g., code_interpreter,
websearch, wolfram_alpha) from the sidebar.
[//]: # (## Documentation)
# What does this PR do?
Adding nbformat version fixes this issue. Not sure exactly why this
needs to be done, but this version was rewritten to the bottom of a nb
file when I changed its name trying to get to the bottom of this. When I
opened it on GH the issue was no longer present
Closes#1837
## Test Plan
N/A
# What does this PR do?
Adds custom model registration functionality to NVIDIAInferenceAdapter
which let's the inference happen on:
- post-training model
- non-llama models in API Catalogue(behind
https://integrate.api.nvidia.com and endpoints compatible with
AyncOpenAI)
## Example Usage:
```python
from llama_stack.apis.models import Model, ModelType
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("nvidia")
_ = client.initialize()
client.models.register(
model_id=model_name,
model_type=ModelType.llm,
provider_id="nvidia"
)
response = client.inference.chat_completion(
model_id=model_name,
messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Write a limerick about the wonders of GPU computing."}],
)
```
## Test Plan
```bash
pytest tests/unit/providers/nvidia/test_supervised_fine_tuning.py
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0
collected 6 items
tests/unit/providers/nvidia/test_supervised_fine_tuning.py ...... [100%]
============================================================ warnings summary ============================================================
../miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076
/home/ubuntu/miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'contentEncoding'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================== 6 passed, 1 warning in 1.51s ======================================================
```
[//]: # (## Documentation)
Updated Readme.md
cc: @dglogo, @sumitb, @mattf
# What does this PR do?
This PR adds support for NVIDIA's NeMo Evaluator API to the Llama Stack
eval module. The integration enables users to evaluate models via the
Llama Stack interface.
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
1. Added unit tests and successfully ran from root of project:
`./scripts/unit-tests.sh tests/unit/providers/nvidia/test_eval.py`
```
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_cancel PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_result PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_status PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_register_benchmark PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_run_eval PASSED
```
2. Verified I could build the Llama Stack image: `LLAMA_STACK_DIR=$(pwd)
llama stack build --template nvidia --image-type venv`
Documentation added to
`llama_stack/providers/remote/eval/nvidia/README.md`
---------
Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
# What does this PR do?
This expands the `test_sse` test suite and fixes some edge cases with
bugs in our SSE error handling to ensure streaming clients always get a
proper error response.
First, we handle the case where a client disconnects before we actually
start streaming the response back. Previously we only handled the case
where a client disconnected as we were streaming the response, but there
was an edge case where a client disconnecting before we streamed any
response back did not trigger our logic to cleanly handle that
disconnect.
Second, we handle the case where an error is thrown from the server
before the actual async generator gets created from the provider. This
happens in scenarios like the newly merged OpenAI API input validation,
where we eagerly raise validation errors before returning the async
generator object that streams the responses back.
## Test Plan
Tested via:
```
python -m pytest -s -v tests/unit/server/test_sse.py
```
Both test cases failed before, and passed afterwards. The test cases
were written based on me experimenting with actual clients that would do
bad things like randomly disconnect or send invalid input in streaming
mode and I hit these two cases, where things were misbehaving in our
error handling.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Include the tool call details with the chat when doing Rag with Remote
vllm
Fixes: #1929
With this PR the tool call is included in the chat returned to vllm, the
model (meta-llama/Llama-3.1-8B-Instruct) the returns the answer as
expected.
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Remove `distributions/**` from integration, external provider, and unit
tests
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
N/A
[//]: # (## Documentation)
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Update External Providers CI to not run on changes to docs, rfcs, and
scripts
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
This PR addresses the content dominance problem that frequently arises
with multiple models when executing queries with the RAG tool. When the
retrieved content is too large, it disproportionately influences the
generation process, causing the model to ignore the original question
and to provide meaningless comments on the retrieved information
instead.
This situation is especially common with agentic RAG, which is the
standard way of doing RAG in Llama Stack, since directly manipulating
the prompt combining the query with the retrieved content is not
possible.
This PR appends a grounding message to the results returned by the
knowledge search tool, reminding the model about the original query and
the purpose of the inference call. This makes the problem significantly
less likely to occur.
## Test Plan
Running the following script before the fix demonstrates the content
dominance problem where the model insists to comment on the retrieved
content and refuses to address the question.
Running the script after the fix results in getting the correct answer.
```
import os
import uuid
from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient
# the server endpoint
LLAMA_STACK_SERVER_URL = "http://localhost:8321"
# inference settings
MODEL_ID = ""meta-llama/Llama-3.1-8B-Instruct"
SYSTEM_PROMPT = "You are a helpful assistant. "
# RAG settings
VECTOR_DB_EMBEDDING_MODEL = "all-MiniLM-L6-v2"
VECTOR_DB_EMBEDDING_DIMENSION = 384
VECTOR_DB_CHUNK_SIZE = 512
# initialize the server connection
client = LlamaStackClient(base_url=os.environ.get("LLAMA_STACK_ENDPOINT", LLAMA_STACK_SERVER_URL))
# init the RAG retrieval parameters
vector_db_id = f"test_vector_db_{uuid.uuid4()}"
vector_providers = [
provider for provider in client.providers.list() if provider.api == "vector_io"
]
vector_provider_to_use = vector_providers[0]
# define and register the document collection to be used
client.vector_dbs.register(
vector_db_id=vector_db_id,
embedding_model=VECTOR_DB_EMBEDDING_MODEL,
embedding_dimension=VECTOR_DB_EMBEDDING_DIMENSION,
provider_id=vector_provider_to_use.provider_id,
)
# ingest the documents into the newly created document collection
urls = [
("https://www.openshift.guide/openshift-guide-screen.pdf", "application/pdf"),
]
documents = [
RAGDocument(
document_id=f"num-{i}",
content=url,
mime_type=url_type,
metadata={},
)
for i, (url, url_type) in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=VECTOR_DB_CHUNK_SIZE,
)
queries = [
"How to install OpenShift?",
]
# initializing the agent
agent = Agent(
client,
model=MODEL_ID,
instructions=SYSTEM_PROMPT,
# we make our agent aware of the RAG tool by including builtin::rag/knowledge_search in the list of tools
tools=[
dict(
name="builtin::rag/knowledge_search",
args={
"vector_db_ids": [vector_db_id], # list of IDs of document collections to consider during retrieval
},
)
],
)
for prompt in queries:
print(f"User> {prompt}")
# create a new turn with a new session ID for each prompt
response = agent.create_turn(
messages=[
{
"role": "user",
"content": prompt,
}
],
session_id=agent.create_session(f"rag-session_{uuid.uuid4()}")
)
# print the response, including tool calls output
for log in AgentEventLogger().log(response):
print(log.content, end='')
```
As part of the build process, we now include the generated run.yaml
(based of the provided build configuration file) into the container. We
updated the entrypoint to use this run configuration as well.
Given this simple distribution configuration:
```
# build.yaml
version: '2'
distribution_spec:
description: Use (an external) Ollama server for running LLM inference
providers:
inference:
- remote::ollama
vector_io:
- inline::faiss
safety:
- inline::llama-guard
agents:
- inline::meta-reference
telemetry:
- inline::meta-reference
eval:
- inline::meta-reference
datasetio:
- remote::huggingface
- inline::localfs
scoring:
- inline::basic
- inline::llm-as-judge
- inline::braintrust
tool_runtime:
- remote::brave-search
- remote::tavily-search
- inline::code-interpreter
- inline::rag-runtime
- remote::model-context-protocol
- remote::wolfram-alpha
container_image: "registry.access.redhat.com/ubi9"
image_type: container
image_name: test
```
Build it:
```
llama stack build --config build.yaml
```
Run it:
```
podman run --rm \
-p 8321:8321 \
-e OLLAMA_URL=http://host.containers.internal:11434 \
--name llama-stack-server \
localhost/leseb-test:0.2.2
```
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
the ramalama team has decided to rename their external provider
`ramalama-stack` (more catchy!). Update docs accordingly
Signed-off-by: Charlie Doern <cdoern@redhat.com>