llama-stack

forked from phoenix-oss/llama-stack-mirror

Author	SHA1	Message	Date
Ashwin Bharambe	4d0bfbf984	feat: add api.llama provider, llama-guard-4 model (#2058 ) This PR adds a llama-stack inference provider for `api.llama.com`, as well as adds entries for Llama-Guard-4 and updated Prompt-Guard models.	2025-04-29 10:07:41 -07:00
Ben Browning	934446ddb4	fix: ollama still using tools with `tool_choice="none"` (#2047 ) # What does this PR do? In our OpenAI API verification tests, ollama was still calling tools even when `tool_choice="none"` was passed in its chat completion requests. Because ollama isn't respecting `tool_choice` properly, this adjusts our provider implementation to remove the `tools` from the request if `tool_choice="none"` is passed in so that it does not attempt to call any of those tools. ## Test Plan I tested this with a couple of Llama models, using both our OpenAI completions integration tests and our verification test suites. ### OpenAI Completions / Chat Completions integration tests These all passed before, and still do. ``` INFERENCE_MODEL="llama3.2:3b-instruct-fp16" \ llama stack build --template ollama --image-type venv --run ``` ``` LLAMA_STACK_CONFIG=http://localhost:8321 \ python -m pytest -v \ tests/integration/inference/test_openai_completion.py \ --text-model "llama3.2:3b-instruct-fp16" ``` ### OpenAI API Verification test suite test_chat_*_tool_choice_none OpenAI API verification tests pass now, when they failed before. See https://github.com/bbrowning/llama-stack-tests/blob/main/openai-api-verification/2025-04-27.md#ollama-llama-stack for an example of these failures from a recent nightly CI run. ``` INFERENCE_MODEL="llama3.3:70b-instruct-q3_K_M" \ llama stack build --template ollama --image-type venv --run ``` ``` cat <<-EOF > tests/verifications/conf/ollama-llama-stack.yaml base_url: http://localhost:8321/v1/openai/v1 api_key_var: OPENAI_API_KEY models: - llama3.3:70b-instruct-q3_K_M model_display_names: llama3.3:70b-instruct-q3_K_M: Llama-3.3-70B-Instruct test_exclusions: llama3.3:70b-instruct-q3_K_M: - test_chat_non_streaming_image - test_chat_streaming_image - test_chat_multi_turn_multiple_images EOF ``` ``` python -m pytest -s -v \ 'tests/verifications/openai_api/test_chat_completion.py' \ --provider=ollama-llama-stack ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-29 10:45:28 +02:00
Kevin Postlethwait	2aca7265b3	fix: add todo for schema validation (#1991 ) # What does this PR do? Change validation to TODO same as was done [here](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/inline/eval/meta_reference/eval.py#L87) until validation can be implemented Closes #1849 ## Test Plan Signed-off-by: Kevin <kpostlet@redhat.com>	2025-04-29 09:59:35 +02:00
Michael Clifford	fe9b5ef08b	fix: tools page on playground resets agent after every interaction (#2044 ) # What does this PR do? This PR updates how the `AgentType` gets set using the radio button on the tools page of the playground. This change is needed due to the fact with its current implementation, the chat interface will resets after every input, preventing users from having a multi-turn conversation with the agent. ## Test Plan Run the Playground without these changes: ```bash streamlit run llama_stack/distribution/ui/app.py ``` Navigate to the tools page and attempt to have a multi-turn conversation. You should see the conversation reset after asking a second question. Repeat the steps above with these changes and you will see that it works as expected when asking the agent multiple questions. Signed-off-by: Michael Clifford <mcliffor@redhat.com>	2025-04-28 23:13:27 +02:00
Ben Browning	8dfce2f596	feat: OpenAI Responses API (#1989 ) # What does this PR do? This provides an initial [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) implementation. The API is not yet complete, and this is more a proof-of-concept to show how we can store responses in our key-value stores and use them to support the Responses API concepts like `previous_response_id`. ## Test Plan I've added a new `tests/integration/openai_responses/test_openai_responses.py` as part of a test-driven development for this new API. I'm only testing this locally with the remote-vllm provider for now, but it should work with any of our inference providers since the only API it requires out of the inference provider is the `openai_chat_completion` endpoint. ``` VLLM_URL="http://localhost:8000/v1" \ INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \ llama stack build --template remote-vllm --image-type venv --run ``` ``` LLAMA_STACK_CONFIG="http://localhost:8321" \ python -m pytest -v \ tests/integration/openai_responses/test_openai_responses.py \ --text-model "meta-llama/Llama-3.2-3B-Instruct" ``` --------- Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-04-28 14:06:00 -07:00
Sébastien Han	79851d93aa	feat: Add Kubernetes authentication (#1778 ) # What does this PR do? This commit adds a new authentication system to the Llama Stack server with support for Kubernetes and custom authentication providers. Key changes include: - Implemented KubernetesAuthProvider for validating Kubernetes service account tokens - Implemented CustomAuthProvider for validating tokens against external endpoints - this is the same code that was already present. - Added test for Kubernetes - Updated server configuration to support authentication settings - Added documentation for authentication configuration and usage The authentication system supports: - Bearer token validation - Kubernetes service account token validation - Custom authentication endpoints ## Test Plan Setup a Kube cluster using Kind or Minikube. Run a server with: ``` server: port: 8321 auth: provider_type: kubernetes config: api_server_url: http://url ca_cert_path: path/to/cert (optional) ``` Run: ``` curl -s -L -H "Authorization: Bearer $(kubectl create token my-user)" http://127.0.0.1:8321/v1/providers ``` Or replace "my-user" with your service account. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-28 22:24:58 +02:00
Rashmi Pawar	e6bbf8d20b	feat: Add NVIDIA NeMo datastore (#1852 ) # What does this PR do? Implemetation of NeMO Datastore register, unregister API. Open Issues: - provider_id gets set to `localfs` in client.datasets.register() as it is specified in routing_tables.py: DatasetsRoutingTable see: #1860 Currently I have passed `"provider_id":"nvidia"` in metadata and have parsed that in `DatasetsRoutingTable` (Not the best approach, but just a quick workaround to make it work for now.) ## Test Plan - Unit test cases: `pytest tests/unit/providers/nvidia/test_datastore.py` ```bash ========================================================== test session starts =========================================================== platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0 rootdir: /home/ubuntu/llama-stack configfile: pyproject.toml plugins: anyio-4.9.0, asyncio-0.26.0, nbval-0.11.0, metadata-3.1.1, html-4.1.1, cov-6.1.0 asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function collected 2 items tests/unit/providers/nvidia/test_datastore.py .. [100%] ============================================================ warnings summary ============================================================ ====================================================== 2 passed, 1 warning in 0.84s ====================================================== ``` cc: @dglogo, @mattf, @yanxi0830	2025-04-28 09:41:59 -07:00
Sajikumar JS	6cf6791de1	fix: updated watsonx inference chat apis with new repo changes (#2033 ) # What does this PR do? There are new changes in repo which needs to add some additional functions to the inference which is fixed. Also need one additional params to pass some extra arguments to watsonx.ai [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) --------- Co-authored-by: Sajikumar JS <sajikumar.js@ibm.com>	2025-04-26 10:17:52 -07:00
ehhuang	0266b20535	docs: update prompt_format.md for llama4 (#2035 ) torchrun --nproc_per_node=8 scripts/generate_prompt_format.py meta-llama/Llama-4-Scout-17B-16E-Instruct ~/local/checkpoints/<path>/ llama_stack.models.llama.llama4.prompts llama_stack/models/llama/llama4/prompt_format.md Co-authored-by: Eric Huang <erichuang@fb.com>	2025-04-25 15:52:15 -07:00
Jash Gulabrai	8713d67ce3	fix: Correctly parse algorithm_config when launching NVIDIA customization job; fix internal request handler (#2025 ) # What does this PR do? This addresses 2 bugs I ran into when launching a fine-tuning job with the NVIDIA Adapter: 1. Session handling in `_make_request` helper function returns an error. ``` INFO: 127.0.0.1:55831 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 500 Internal Server Error 16:11:45.643 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (270.44ms) 16:11:45.643 [ERROR] Error executing endpoint route='/v1/post-training/supervised-fine-tune' method='post' Traceback (most recent call last): File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 201, in endpoint return await maybe_await(value) File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 161, in maybe_await return await value File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 408, in supervised_fine_tune response = await self._make_request( File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 98, in _make_request async with self.session.request(method, url, params=params, json=json, **kwargs) as response: File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 1425, in __aenter__ self._resp: _RetType = await self._coro File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 579, in _request handle = tm.start() File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/helpers.py", line 587, in start return self._loop.call_at(when, self.__call__) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 724, in call_at self._check_closed() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 510, in _check_closed raise RuntimeError('Event loop is closed') RuntimeError: Event loop is closed ``` Note: This only occurred when initializing the client like so: ``` client = LlamaStackClient( base_url="http://0.0.0.0:8321" ) response = client.post_training.supervised_fine_tune(...) # Returns error ``` I didn't run into this issue when using the library client: ``` client = LlamaStackAsLibraryClient("nvidia") client.initialize() response = client.post_training.supervised_fine_tune(...) # Works fine ``` 2. The `algorithm_config` param in `supervised_fine_tune` is parsed as a `dict` when run from unit tests, but a Pydantic model when invoked using the Llama Stack client. So, the call fails outside of unit tests: ``` INFO: 127.0.0.1:54024 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 500 Internal Server Error 21:14:02.315 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (71.18ms) 21:14:02.314 [ERROR] Error executing endpoint route='/v1/post-training/supervised-fine-tune' method='post' Traceback (most recent call last): File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 205, in endpoint return await maybe_await(value) File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 164, in maybe_await return await value File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 407, in supervised_fine_tune "adapter_dim": algorithm_config.get("adapter_dim"), File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/pydantic/main.py", line 891, in __getattr__ raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}') AttributeError: 'LoraFinetuningConfig' object has no attribute 'get' ``` The code assumes `algorithm_config` should be `dict`, so I just handle both cases. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan 1. I ran a local Llama Stack server with the necessary env vars: ``` lama stack run llama_stack/templates/nvidia/run.yaml --port 8321 --env ... ``` And invoked `supervised_fine_tune` to confirm neither of the errors above occur. ``` client = LlamaStackClient( base_url="http://0.0.0.0:8321" ) response = client.post_training.supervised_fine_tune(...) ``` 2. I confirmed the unit tests still pass: `./scripts/unit-tests.sh tests/unit/providers/nvidia/test_supervised_fine_tuning.py` [//]: # (## Documentation) --------- Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-25 13:21:50 -07:00
ehhuang	1b2e116a2a	fix: tool call encoded twice (#2034 ) # What does this PR do? ## Test Plan LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B --vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct	2025-04-25 13:16:16 -07:00
Derek Higgins	0e4307de0f	docs: Fix missing --gpu all flag in Docker run commands (#2026 ) adding the --gpu all flag to Docker run commands for meta-reference-gpu distributions ensures models are loaded into GPU instead of CPU. Remove docs for meta-reference-quantized-gpu The distribution was removed in #1887 but these files were left behind. Fixes: #1798 # What does this PR do? Fixes doc to add --gpu all command to docker run [//]: # (If resolving an issue, uncomment and update the line below) Closes #1798 ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] verified in docker documentation but untested --------- Signed-off-by: Derek Higgins <derekh@redhat.com>	2025-04-25 12:17:31 -07:00
Sajikumar JS	1bb1d9b2ba	feat: Add watsonx inference adapter (#1895 ) # What does this PR do? IBM watsonx ai added as the inference [#1741 ](https://github.com/meta-llama/llama-stack/issues/1741) [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) --------- Co-authored-by: Sajikumar JS <sajikumar.js@ibm.com>	2025-04-25 11:29:21 -07:00
ehhuang	29072f40ab	feat: new system prompt for llama4 (#2031 ) Tests: LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B --vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct Co-authored-by: Eric Huang <erichuang@fb.com>	2025-04-25 11:29:08 -07:00
Ashwin Bharambe	4bbd0c0693	fix: add endpoint route debugs	2025-04-25 10:40:12 -07:00
Andy Xie	f5dae0517c	feat: Support ReAct Agent on Tools Playground (#2012 ) # What does this PR do? ReAct prompting attempts to use the Thinking, Action, Observation loop to improve the model's reasoning ability via prompt engineering. With this PR, it now supports the various features in Streamlit's playground: 1. Adding the selection box for choosing between Agent Type: normal, ReAct. 2. Adding the Thinking, Action, Observation loop streamlit logic for ReAct agent, as seen in many LLM clients. 3. Improving tool calling accuracies via ReAct prompting, e.g. using web_search. Folded ![react_output_folded png](https://github.com/user-attachments/assets/bf1bdce7-e6ef-455d-b6b0-c22a64e9d5c1) Collapsed ![react_output_collapsed](https://github.com/user-attachments/assets/cda2fc17-df0b-400d-971c-988de821f2a4) [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] Run the playground and uses reasoning prompts to see for yourself. Steps to test the ReAct agent mode: 1. Setup a llama-stack server as [getting_started](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html) describes. 2. Setup your Web Search API keys under `llama_stack/distribution/ui/modules/api.py`. 3. Run the streamlit playground and try ReAct agent, possibly with `websearch`, with the command: `streamlit run llama_stack/distribution/ui/app.py`. ## Test Process Current results are demonstrated with `llama-3.2-3b-instruct`. Results will vary with different models. You should be seeing clear distinction with normal agent and ReAct agent. Example prompts listed below: 1. Aside from the Apple Remote, what other devices can control the program Apple Remote was originally designed to interact with? 2. What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into? ## Example Test Results Web search on AppleTV <img width="1440" alt="normal_output_appletv" src="https://github.com/user-attachments/assets/bf6b3273-1c94-4976-8b4a-b2d82fe41330" /> <img width="1440" alt="react_output_appletv" src="https://github.com/user-attachments/assets/687f1feb-88f4-4d32-93d5-5013d0d5fe25" /> Web search on Colorado <img width="1440" alt="normal_output_colorado" src="https://github.com/user-attachments/assets/10bd3ad4-f2ad-466d-9ce0-c66fccee40c1" /> <img width="1440" alt="react_output_colorado" src="https://github.com/user-attachments/assets/39cfd82d-2be9-4e2f-9f90-a2c4840185f7" /> Web search tool + MCP Slack server <img width="1250" alt="normal_output_search_slack png" src="https://github.com/user-attachments/assets/72e88125-cdbf-4a90-bcb9-ab412c51d62d" /> <img width="1217" alt="react_output_search_slack" src="https://github.com/user-attachments/assets/8ae04efb-a4fd-49f6-9465-37dbecb6b73e" /> ![slack_screenshot](https://github.com/user-attachments/assets/bb70e669-6067-462a-bdf6-7aaac6ccbcef)	2025-04-25 17:01:51 +02:00
Roland Huß	121c73c2f5	feat(cli): add interactive tab completion for image type selection (#2027 ) # What does this PR do? Enhances the user experience in the `llama stack build` command by adding interactive TAB completion for image type selection. This ensures the UX consistency with other parts of the CLI that already support tab completion, such as provider selection, providing a more intuitive and discoverable interface for users. <img width="1531" alt="image" src="https://github.com/user-attachments/assets/12161d45-451d-4820-b34d-7ea4decf810f" />	2025-04-25 16:57:42 +02:00
Surya Prakash Pathak	59b7593609	feat: Enhance tool display in Tools sidebar by simplifying tool identifiers (#2024 ) # What does this PR do? This PR improves the Tools page in the LlamaStack Playground UI by enhancing the readability of the active tool list shown in the sidebar. - Previously, active tools were displayed in a flat JSON array with verbose identifiers (e.g., builtin::code_interpreter:code_interpreter). - This PR updates the logic to group tools by their toolgroup (e.g., builtin::websearch) and renders each tool name in a simplified, human-readable format (e.g., web_search). - This change improves usability when working with multiple toolgroups, especially in configurations involving MCP tools or complex tool identifiers. Before and After Comparison: Before ![Screenshot 2025-04-24 at 1 05 47 PM](https://github.com/user-attachments/assets/44843a79-49dc-4b4d-ab28-c6187f9bb5ba) After ![Screenshot 2025-04-24 at 1 24 08 PM](https://github.com/user-attachments/assets/ebb01006-e0a9-4664-a95a-e6f72eea6f94) [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan - Followed the [LlamaStack UI Developer Setup instructions](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/distribution/ui) - Ran the Streamlit UI via: `uv run --with "[.ui]" streamlit run llama_stack/distribution/ui/app.py` - Selected multiple built-in toolgroups (e.g., code_interpreter, websearch, wolfram_alpha) from the sidebar. [//]: # (## Documentation)	2025-04-25 10:22:22 +02:00
Rashmi Pawar	ace82836c1	feat: NVIDIA allow non-llama model registration (#1859 ) # What does this PR do? Adds custom model registration functionality to NVIDIAInferenceAdapter which let's the inference happen on: - post-training model - non-llama models in API Catalogue(behind https://integrate.api.nvidia.com and endpoints compatible with AyncOpenAI) ## Example Usage: ```python from llama_stack.apis.models import Model, ModelType from llama_stack.distribution.library_client import LlamaStackAsLibraryClient client = LlamaStackAsLibraryClient("nvidia") _ = client.initialize() client.models.register( model_id=model_name, model_type=ModelType.llm, provider_id="nvidia" ) response = client.inference.chat_completion( model_id=model_name, messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Write a limerick about the wonders of GPU computing."}], ) ``` ## Test Plan ```bash pytest tests/unit/providers/nvidia/test_supervised_fine_tuning.py ========================================================== test session starts =========================================================== platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0 rootdir: /home/ubuntu/llama-stack configfile: pyproject.toml plugins: anyio-4.9.0 collected 6 items tests/unit/providers/nvidia/test_supervised_fine_tuning.py ...... [100%] ============================================================ warnings summary ============================================================ ../miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076 /home/ubuntu/miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'contentEncoding'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/ warn( -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================== 6 passed, 1 warning in 1.51s ====================================================== ``` [//]: # (## Documentation) Updated Readme.md cc: @dglogo, @sumitb, @mattf	2025-04-24 17:13:33 -07:00
Jash Gulabrai	cc77f79f55	feat: Add NVIDIA Eval integration (#1890 ) # What does this PR do? This PR adds support for NVIDIA's NeMo Evaluator API to the Llama Stack eval module. The integration enables users to evaluate models via the Llama Stack interface. ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] 1. Added unit tests and successfully ran from root of project: `./scripts/unit-tests.sh tests/unit/providers/nvidia/test_eval.py` ``` tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_cancel PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_result PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_status PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_register_benchmark PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_run_eval PASSED ``` 2. Verified I could build the Llama Stack image: `LLAMA_STACK_DIR=$(pwd) llama stack build --template nvidia --image-type venv` Documentation added to `llama_stack/providers/remote/eval/nvidia/README.md` --------- Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-24 17:12:42 -07:00
Ben Browning	0b6cd45950	fix: Additional streaming error handling (#2007 ) # What does this PR do? This expands the `test_sse` test suite and fixes some edge cases with bugs in our SSE error handling to ensure streaming clients always get a proper error response. First, we handle the case where a client disconnects before we actually start streaming the response back. Previously we only handled the case where a client disconnected as we were streaming the response, but there was an edge case where a client disconnecting before we streamed any response back did not trigger our logic to cleanly handle that disconnect. Second, we handle the case where an error is thrown from the server before the actual async generator gets created from the provider. This happens in scenarios like the newly merged OpenAI API input validation, where we eagerly raise validation errors before returning the async generator object that streams the responses back. ## Test Plan Tested via: ``` python -m pytest -s -v tests/unit/server/test_sse.py ``` Both test cases failed before, and passed afterwards. The test cases were written based on me experimenting with actual clients that would do bad things like randomly disconnect or send invalid input in streaming mode and I hit these two cases, where things were misbehaving in our error handling. Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-24 17:01:45 -07:00
Derek Higgins	c8797f1125	fix: Including tool call in chat (#1931 ) Include the tool call details with the chat when doing Rag with Remote vllm Fixes: #1929 With this PR the tool call is included in the chat returned to vllm, the model (meta-llama/Llama-3.1-8B-Instruct) the returns the answer as expected. Signed-off-by: Derek Higgins <derekh@redhat.com>	2025-04-24 16:59:10 -07:00
ehhuang	7ed137e963	fix: meta ref inference (#2022 ) MAX_BATCH_SIZE=10 LLAMA_MODELS_DEBUG=1 LLAMA_STACK_PORT=5002 LLAMA_STACK_LOGGING='all=info' llama stack run meta-reference-gpu --env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct --env INFERENCE_CHECKPOINT_DIR=... LLAMA_STACK_CONFIG=http://localhost:5002/ pytest -s -v tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B --vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct Co-authored-by: Eric Huang <erichuang@fb.com>	2025-04-24 13:03:35 -07:00
Ashwin Bharambe	a5d6ab16b2	fix: meta-reference parallel utils bug, use isinstance not equality	2025-04-24 11:27:49 -07:00
Ilya Kolchinsky	e664ba91d8	fix: prevent the knowledge search tool from confusing the model with long content (#1908 ) # What does this PR do? This PR addresses the content dominance problem that frequently arises with multiple models when executing queries with the RAG tool. When the retrieved content is too large, it disproportionately influences the generation process, causing the model to ignore the original question and to provide meaningless comments on the retrieved information instead. This situation is especially common with agentic RAG, which is the standard way of doing RAG in Llama Stack, since directly manipulating the prompt combining the query with the retrieved content is not possible. This PR appends a grounding message to the results returned by the knowledge search tool, reminding the model about the original query and the purpose of the inference call. This makes the problem significantly less likely to occur. ## Test Plan Running the following script before the fix demonstrates the content dominance problem where the model insists to comment on the retrieved content and refuses to address the question. Running the script after the fix results in getting the correct answer. ``` import os import uuid from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient # the server endpoint LLAMA_STACK_SERVER_URL = "http://localhost:8321" # inference settings MODEL_ID = ""meta-llama/Llama-3.1-8B-Instruct" SYSTEM_PROMPT = "You are a helpful assistant. " # RAG settings VECTOR_DB_EMBEDDING_MODEL = "all-MiniLM-L6-v2" VECTOR_DB_EMBEDDING_DIMENSION = 384 VECTOR_DB_CHUNK_SIZE = 512 # initialize the server connection client = LlamaStackClient(base_url=os.environ.get("LLAMA_STACK_ENDPOINT", LLAMA_STACK_SERVER_URL)) # init the RAG retrieval parameters vector_db_id = f"test_vector_db_{uuid.uuid4()}" vector_providers = [ provider for provider in client.providers.list() if provider.api == "vector_io" ] vector_provider_to_use = vector_providers[0] # define and register the document collection to be used client.vector_dbs.register( vector_db_id=vector_db_id, embedding_model=VECTOR_DB_EMBEDDING_MODEL, embedding_dimension=VECTOR_DB_EMBEDDING_DIMENSION, provider_id=vector_provider_to_use.provider_id, ) # ingest the documents into the newly created document collection urls = [ ("https://www.openshift.guide/openshift-guide-screen.pdf", "application/pdf"), ] documents = [ RAGDocument( document_id=f"num-{i}", content=url, mime_type=url_type, metadata={}, ) for i, (url, url_type) in enumerate(urls) ] client.tool_runtime.rag_tool.insert( documents=documents, vector_db_id=vector_db_id, chunk_size_in_tokens=VECTOR_DB_CHUNK_SIZE, ) queries = [ "How to install OpenShift?", ] # initializing the agent agent = Agent( client, model=MODEL_ID, instructions=SYSTEM_PROMPT, # we make our agent aware of the RAG tool by including builtin::rag/knowledge_search in the list of tools tools=[ dict( name="builtin::rag/knowledge_search", args={ "vector_db_ids": [vector_db_id], # list of IDs of document collections to consider during retrieval }, ) ], ) for prompt in queries: print(f"User> {prompt}") # create a new turn with a new session ID for each prompt response = agent.create_turn( messages=[ { "role": "user", "content": prompt, } ], session_id=agent.create_session(f"rag-session_{uuid.uuid4()}") ) # print the response, including tool calls output for log in AgentEventLogger().log(response): print(log.content, end='') ```	2025-04-24 16:38:38 +02:00
Sébastien Han	14e60e3c02	feat: include run.yaml in the container image (#2005 ) As part of the build process, we now include the generated run.yaml (based of the provided build configuration file) into the container. We updated the entrypoint to use this run configuration as well. Given this simple distribution configuration: ``` # build.yaml version: '2' distribution_spec: description: Use (an external) Ollama server for running LLM inference providers: inference: - remote::ollama vector_io: - inline::faiss safety: - inline::llama-guard agents: - inline::meta-reference telemetry: - inline::meta-reference eval: - inline::meta-reference datasetio: - remote::huggingface - inline::localfs scoring: - inline::basic - inline::llm-as-judge - inline::braintrust tool_runtime: - remote::brave-search - remote::tavily-search - inline::code-interpreter - inline::rag-runtime - remote::model-context-protocol - remote::wolfram-alpha container_image: "registry.access.redhat.com/ubi9" image_type: container image_name: test ``` Build it: ``` llama stack build --config build.yaml ``` Run it: ``` podman run --rm \ -p 8321:8321 \ -e OLLAMA_URL=http://host.containers.internal:11434 \ --name llama-stack-server \ localhost/leseb-test:0.2.2 ``` Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-24 11:29:53 +02:00
Ben Browning	fa5dfee07b	fix: Return HTTP 400 for OpenAI API validation errors (#2002 ) # What does this PR do? When clients called the Open AI API with invalid input that wasn't caught by our own Pydantic API validation but instead only caught by the backend inference provider, that backend inference provider was returning a HTTP 400 error. However, we were wrapping that into a HTTP 500 error, obfuscating the actual issue from calling clients and triggering OpenAI client retry logic. This change adjusts our existing `translate_exception` method in `server.py` to wrap `openai.BadRequestError` as HTTP 400 errors, passing through the string representation of the error message to the calling user so they can see the actual input validation error and correct it. I tried changing this in a few other places, but ultimately `translate_exception` was the only real place to handle this for both streaming and non-streaming requests across all inference providers that use the OpenAI server APIs. This also tightens up our validation a bit for the OpenAI chat completions API, to catch empty `messages` parameters, invalid `tool_choice` parameters, invalid `tools` items, or passing `tool_choice` when `tools` isn't given. Lastly, this extends our OpenAI API chat completions verifications to also check for consistent input validation across providers. Providers behind Llama Stack should automatically pass all the new tests due to the input validation added here, but some of the providers fail this test when not run behind Llama Stack due to differences in how they handle input validation and errors. (Closes #1951) ## Test Plan To test this, start an OpenAI API verification stack: ``` llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml ``` Then, run the new verification tests with your provider(s) of choice: ``` python -m pytest -s -v \ tests/verifications/openai_api/test_chat_completion.py \ --provider openai-llama-stack python -m pytest -s -v \ tests/verifications/openai_api/test_chat_completion.py \ --provider together-llama-stack ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-23 17:48:32 +02:00
Michael Clifford	64f747fe09	feat: add tool name to chat output in playground (#1996 ) # What does this PR do? This PR adds the name of the tool that is used by the agent on the "tools" page of the playground. See image below for an example. ![Screenshot 2025-04-18 at 3 14 18 PM](https://github.com/user-attachments/assets/04e97783-4003-4121-9446-9e0ad7209256) ## Test Plan Run the playground and navigate to the tools page. There users can see that this additional text is present when tools are invoked and absent when they are not. ``` streamlit run llama_stack/distribution/ui/app.py ``` Signed-off-by: Michael Clifford <mcliffor@redhat.com>	2025-04-23 15:57:54 +02:00
Ben Browning	dc46725f56	fix: properly handle streaming client disconnects (#2000 ) # What does this PR do? Previously, when a streaming client would disconnect before we were finished streaming the entire response, an error like the below would get raised from the `sse_generator` function in `llama_stack/distribution/server/server.py`: ``` AttributeError: 'coroutine' object has no attribute 'aclose'. Did you mean: 'close'? ``` This was because we were calling `aclose` on a coroutine instead of the awaited value from that coroutine. This change fixes that, so that we save off the awaited value and then can call `aclose` on it if we encounter an `asyncio.CancelledError`, like we see when a client disconnects before we're finished streaming. The other changes in here are to add a simple set of tests for the happy path of our SSE streaming and this client disconnect path. That unfortunately requires adding one more dependency into our unit test section of pyproject.toml since `server.py` requires loading some of the telemetry code for me to test this functionality. ## Test Plan I wrote the tests in `tests/unit/server/test_sse.py` first, verified the client disconnected test failed before my change, and that it passed afterwards. ``` python -m pytest -s -v tests/unit/server/test_sse.py ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-23 15:44:28 +02:00
Ilya Kolchinsky	deee355952	fix: Added lazy initialization of the remote vLLM client to avoid issues with expired asyncio event loop (#1969 ) # What does this PR do? Closes #1968. The asynchronous client in `VLLMInferenceAdapter` is now initialized directly before first use and not in `VLLMInferenceAdapter.initialize`. This prevents issues arising due to accessing an expired event loop from a completed `asyncio.run`. ## Test Plan Ran unit tests, including `test_remote_vllm.py`. Ran the code snippet mentioned in #1968. --------- Co-authored-by: Sébastien Han <seb@redhat.com>	2025-04-23 15:33:19 +02:00
Ilya Kolchinsky	d39462d073	feat: Hide tool output under an expander in Playground UI (#2003 ) # What does this PR do? Now, tool outputs and retrieved chunks from the vector DB (i.e., everything except for the actual model reply) are hidden under an expander form when presented to the user. # Test Plan Navigate to the RAG page in the Playground UI.	2025-04-23 15:32:12 +02:00
Ben Browning	825ce39879	fix: Together provider shutdown and default to non-streaming (#2001 ) # What does this PR do? The together inference provider was throwing a stack trace every time it shut down, as it was trying to call a non-existent `close` method on the AsyncTogether client. While fixing that, I also adjusted its shutdown logic to close the OpenAI client if we've created one of those, as that client does have a `close` method. In testing that, I also realized we were defaulting to treating all requests as streaming requests instead of defaulting to non-streaming. So, this flips that default to non-streaming to match how the other providers work. ## Test Plan I tested this by ensuring the together inference provider no longer spits out a long stack trace when shutting it down and by running the OpenAI API chat completion verification suite to ensure the change in default streaming logic didn't mess anything else up. Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-22 17:47:53 +02:00
Michael Clifford	e4d001c4e4	feat: cleanup sidebar formatting on tools playground (#1998 ) # What does this PR do? This PR cleans up the sidebar on the tools page of the playground in the following ways: * created a clearer hierarchy of configuration options and tool selections. * Removed the `mcp::` or `builtin::` prefixes from the tool selection buttons. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Run the playground and see the updated sidebar does not cause any new errors. ``` streamlit run llama_stack/distribution/ui/app.py ``` [//]: # (## Documentation) Signed-off-by: Michael Clifford <mcliffor@redhat.com>	2025-04-22 10:40:37 +02:00
Kevin Postlethwait	3110ad1e7c	fix: update ref to raw_errors due to new version of pydantic (#1995 ) `37da47ef8e (diff-4d7c51b1efe9043e44439a949dfd92e5827321b34082903477fd04876edb7552)` Pydantic was updated from v1 to v2 in this commit which caused this breaking change # What does this PR do? Part of #1857 This won't fix the Validation error with the example, but it will correctly supply user with a proper error rather than a 5xx code. Signed-off-by: Kevin <kpostlet@redhat.com>	2025-04-21 11:50:12 -07:00
Ben Browning	602e949a46	fix: OpenAI Completions API and Fireworks (#1997 ) # What does this PR do? We were passing a dict into the compat mixin for OpenAI Completions when using Llama models with Fireworks, and that was breaking some strong typing code that was added in openai_compat.py. We shouldn't have been converting these params to a dict in that case anyway, so this adjusts things to pass the params in as their actual original types when calling the OpenAIChatCompletionToLlamaStackMixin. ## Test Plan All of the fireworks provider verification tests were failing due to some OpenAI compatibility cleanup in #1962. The changes in that PR were good to make, and this just cleans up the fireworks provider code to stop passing in untyped dicts to some of those `openai_compat.py` methods since we have the original strongly-typed parameters we can pass in. ``` llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml ``` ``` python -m pytest -s -v tests/verifications/openai_api/test_chat_completion.py --provider=fireworks-llama-stack ``` Before this PR, all of the fireworks OpenAI verification tests were failing. Now, most of them are passing. Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-21 11:49:12 -07:00
Jash Gulabrai	0d06c654d0	feat: Update NVIDIA to GA docs; remove notebook reference until ready (#1999 ) # What does this PR do? - Update NVIDIA documentation links to GA docs - Remove reference to notebooks until merged [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-18 19:13:18 -04:00
Sébastien Han	94f83382eb	feat: allow building distro with external providers (#1967 ) # What does this PR do? We can now build a distribution that includes external providers. Closes: https://github.com/meta-llama/llama-stack/issues/1948 ## Test Plan Build a distro with an external provider following the doc instructions. [//]: # (## Documentation) Added. Rendered: ![Screenshot 2025-04-18 at 11 26 39](https://github.com/user-attachments/assets/afcf3d50-8d30-48c3-8d24-06a4b3662881) Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-18 17:18:28 +02:00
Yuan Tang	c4570bcb48	docs: Add tips for debugging remote vLLM provider (#1992 ) # What does this PR do? This is helpful when debugging issues with vLLM + Llama Stack after this PR https://github.com/vllm-project/vllm/pull/15593 --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-04-18 14:47:47 +02:00
Matthew Farrellee	9845631d51	feat: update nvidia inference provider to use model_store (#1988 ) # What does this PR do? NVIDIA Inference provider was using the ModelRegistryHelper to map input model ids to provider model ids. this updates it to use the model_store. ## Test Plan `LLAMA_STACK_CONFIG=http://localhost:8321 uv run pytest -v tests/integration/inference/{test_embedding.py,test_text_inference.py,test_openai_completion.py} --embedding-model nvidia/llama-3.2-nv-embedqa-1b-v2 --text-model=meta-llama/Llama-3.1-70B-Instruct`	2025-04-18 10:16:43 +02:00
Alexey Rybak	e72b1076ca	fix(build): add UBI 9 compiler tool‑chain (#1983 ) # What does this PR do? Fixes the UBI 9 container build failure ( `error: command 'gcc' failed` when installing `polyleven`, `faiss`, etc.) by installing the missing compiler tool‑chain: - `python3.11-devel gcc` make added to the UBI 9 `dnf install` line. ### Closes #1970 ## Test Plan - Build a distro with an UBI image	2025-04-18 09:49:10 +02:00
ehhuang	2976b5d992	fix: OAI compat endpoint for meta reference inference provider (#1962 ) Test plan: python tests/verifications/generate_report.py --providers fireworks,together,llama_meta_ref,openai Co-authored-by: Eric Huang <erichuang@fb.com>	2025-04-17 11:16:04 -07:00
Alexey Rybak	326cbba579	feat(agents): add agent naming functionality (#1922 ) # What does this PR do? Allow users to name an agent and use the name in telemetry instead of relying on randomly generated agent_ids. This improves the developer experience by making it easier to find specific agents in telemetry logs. Closes #1832 ## Test Plan - Added tests to verify the agent name is properly stored and retrieved - Ran `uv run -- pytest -v tests/integration/telemetry/test_telemetry.py::test_agent_name_filtering` from the root of the project and made sure the tests pass - Ran `uv run -- pytest -v tests/integration/telemetry/test_telemetry.py::test_agent_query_spans` to verify existing code without agent names still works correctly ## Use Example ``` agent = Agent( llama_stack_client, model=text_model_id, name="CustomerSupportAgent", # New parameter instructions="You are a helpful customer support assistant" ) session_id = agent.create_session(f"test-session-{uuid4()}") ``` ## Implementation Notes - Agent names are optional string parameters with no additional validation - Names are not required to be unique - multiple agents can have the same name - The agent_id remains the unique identifier for an agent --------- Co-authored-by: raghotham <raghotham@gmail.com>	2025-04-17 07:02:47 -07:00
Ben Browning	5b8e75b392	fix: OpenAI spec cleanup for assistant requests (#1963 ) # What does this PR do? Some of our multi-turn verification tests were failing because I had accidentally marked content as a required field in the OpenAI chat completion request assistant messages, but it's actually optional. It is required for messages from other roles, but assistant is explicitly allowed to be optional. Similarly, the assistant message tool_calls field should default to None instead of an empty list. These two changes get the openai-llama-stack verification test back to 100% passing, just like it passes 100% when not behind Llama Stack. They also increase the pass rate of some of the other providers in the verification test, but don't get them to 100%. ## Test Plan I started a Llama Stack server setup to run all the verification tests (requires OPENAI_API_KEY env variable) ``` llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml ``` Then, I manually ran the verification tests to see which were failing, fix them, and ran them again after these changes to ensure they were all passing. ``` python -m pytest -s -v tests/verifications/openai_api/test_chat_completion.py --provider=openai-llama-stack ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-17 06:56:10 -07:00
Matthew Farrellee	4205376653	chore: add meta/llama-3.3-70b-instruct as supported nvidia inference provider model (#1985 ) see https://build.nvidia.com/meta/llama-3_3-70b-instruct	2025-04-17 06:50:40 -07:00
Jash Gulabrai	2ae1d7f4e6	docs: Add NVIDIA platform distro docs (#1971 ) # What does this PR do? Add NVIDIA platform docs that serve as a starting point for Llama Stack users and explains all supported microservices. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) --------- Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-17 05:54:30 -07:00
Jash Gulabrai	45e08ff417	fix: Handle case when Customizer Job status is unknown (#1965 ) # What does this PR do? This PR handles the case where a Customization Job's status is `unknown`. Since we don't map `unknown` to a valid `JobStatus`, the PostTraining provider throws an exception when fetching/listing a job. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] `./scripts/unit-tests.sh tests/unit/providers/nvidia/test_supervised_fine_tuning.py` succeeds [//]: # (## Documentation) Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-17 10:27:07 +02:00
Alexey Rybak	8f57b08f2c	fix(build): always pass path when no template/config provided (#1982 ) # What does this PR do? Fixes a crash that occurred when building a stack as a container image via the interactive wizard without supplying --template or --config. - Root cause: template_or_config was None; only the container path relies on that parameter, which later reaches subprocess.run() and triggers `TypeError: expected str, bytes or os.PathLike object, not NoneType.` - Change: in `_run_stack_build_command_from_build_config` we now fall back to the freshly‑written build‑spec file whenever both optional sources are missing. Also adds a spy‑based unit test that asserts a valid string path is passed to build_image() for container builds. ### Closes #1976 ## Test Plan - New unit test: test_build_path.py. Monkey‑patches build_image, captures the fourth argument, and verifies it is a real path - Manual smoke test: ``` llama stack build --image-type container # answer wizard prompts ``` Build proceeds into Docker without raising the previous TypeError. ## Future Work Harmonise `build_image` arguments so every image type receives the same inputs, eliminating this asymmetric special‑case.	2025-04-17 10:20:43 +02:00
Sébastien Han	6ed92e03bc	fix: print traceback on build failure (#1966 ) # What does this PR do? Build failures are hard to read, sometimes we get errors like: ``` Error building stack: 'key' ``` Which are difficult to debug without a proper trace. ## Test Plan If `llama stack build` fails you get a traceback now. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-17 09:45:21 +02:00
Michael Clifford	f12011794b	fix: Updated tools playground to allow vdb selection (#1960 ) # What does this PR do? This PR lets users select an existing vdb to use with their agent on the tools page of the playground. The drop down menu that lets users select a vdb only appears when the rag tool is selected. Without this change, there is no way for a user to specify which vdb they want their rag tool to use on the tools page. I have intentionally left the RAG options sparse here since the full RAG options are exposed on the RAG page. ## Test Plan Without these changes the RAG tool will throw the following error: `name: knowledge_search) does not have any content ` With these changes the RAG tool works as expected. Signed-off-by: Michael Clifford <mcliffor@redhat.com>	2025-04-17 09:29:40 +02:00
Jash Gulabrai	30fc66923b	fix: Add llama-3.2-1b-instruct to NVIDIA fine-tuned model list (#1975 ) # What does this PR do? Adds `meta/llama-3.2-1b-instruct` to list of models that NeMo Customizer can fine-tune. This is the model our example notebooks typically use for fine-tuning. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-16 15:02:08 -07:00

1 2 3 4 5 ...

1087 commits