llama-stack

forked from phoenix-oss/llama-stack-mirror

Author	SHA1	Message	Date
Ben Browning	40e71758d9	fix: inference providers still using tools with `tool_choice="none"` (#2048 ) # What does this PR do? In our OpenAI API verification tests, some providers were still calling tools even when `tool_choice="none"` was passed in the chat completion requests. Because they aren't all respecting `tool_choice` properly, this adjusts our routing implementation to remove the `tools` and `tool_choice` from the request if `tool_choice="none"` is passed in so that it does not attempt to call any of those tools. Adjusting this in the router fixes this across all providers. This also cleans up the non-streaming together.ai responses for tools, ensuring it returns `None` instead of an empty list when there are no tool calls, to exactly match the OpenAI API responses in that case. ## Test Plan I observed existing failures in our OpenAI API verification suite - see https://github.com/bbrowning/llama-stack-tests/blob/main/openai-api-verification/2025-04-27.md#together-llama-stack for the failing `test_chat_*_tool_choice_none` tests. All streaming and non-streaming variants were failing across all 3 tested models. After this change, all of those 6 failing tests are now passing with no regression in the other tests. I verified this via: ``` llama stack run --image-type venv \ tests/verifications/openai-api-verification-run.yaml ``` ``` python -m pytest -s -v \ 'tests/verifications/openai_api/test_chat_completion.py' \ --provider=together-llama-stack ``` The entire verification suite is not 100% on together.ai yet, but it's getting closer. This also increased the pass rate for fireworks.ai, and did not regress the groq or openai tests at all. Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-05-07 14:34:47 +02:00
ehhuang	664161c462	fix: llama4 tool use prompt fix (#2103 ) Tests: LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B --vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B --vision-model Llama-4-Maverick-17B-128E-Instruct --text-model Llama-4-Maverick-17B-128E-Instruct Co-authored-by: Eric Huang <erichuang@fb.com>	2025-05-06 22:18:31 -07:00
Jorge Piedrahita Ortiz	b2b00a216b	feat(providers): sambanova updated to use LiteLLM openai-compat (#1596 ) # What does this PR do? switch sambanova inference adaptor to LiteLLM usage to simplify integration and solve issues with current adaptor when streaming and tool calling, models and templates updated ## Test Plan pytest -s -v tests/integration/inference/test_text_inference.py --stack-config=sambanova --text-model=sambanova/Meta-Llama-3.3-70B-Instruct pytest -s -v tests/integration/inference/test_vision_inference.py --stack-config=sambanova --vision-model=sambanova/Llama-3.2-11B-Vision-Instruct	2025-05-06 16:50:22 -07:00
Kevin Postlethwait	a57985eeac	fix: add check for interleavedContent (#1973 ) # What does this PR do? Checks for RAGDocument of type InterleavedContent I noticed when stepping through the code that the supported types for `RAGDocument` included `InterleavedContent` as a content type. This type is not checked against before putting the `doc.content` is regex matched against. This would cause a runtime error. This change adds an explicit check for type. The only other part that I'm unclear on is how to handle the `ImageContent` type since this would always just return `<image>` which seems like an undesired behavior. Should the `InterleavedContent` type be removed from `RAGDocument` and replaced with `URI \| str`? ## Test Plan [//]: # (## Documentation) --------- Signed-off-by: Kevin <kpostlet@redhat.com>	2025-05-06 09:55:07 -07:00
Sébastien Han	1a529705da	chore: more mypy fixes (#2029 ) # What does this PR do? Mainly tried to cover the entire llama_stack/apis directory, we only have one left. Some excludes were just noop. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-05-06 09:52:31 -07:00
Ihar Hrachyshka	c219a74fa0	fix: Don't require efficiency_config for torchtune (#2104 ) # What does this PR do? Revert a change that by mistake forced efficiency_config on torchtune provider users. ``` fix: Don't require efficiency_config for torchtune It was enforced by mistake when `0751a960a5` merged. Other asserts made sense in that the code was written, potentially, to always expect a non-None value. But not efficiency_config. ``` Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-05-06 09:50:44 -07:00
Divya	3022f7b642	feat: Adding TLS support for Remote::Milvus vector_io (#2011 ) # What does this PR do? For the Issue :- #[2010](https://github.com/meta-llama/llama-stack/issues/2010) Currently, if we try to connect the Llama stack server to a remote Milvus instance that has TLS enabled, the connection fails because TLS support is not implemented in the Llama stack codebase. As a result, users are unable to use secured Milvus deployments out of the box. After adding this , the user will be able to connect to remote::Milvus which is TLS enabled . if TLS enabled :- ``` vector_io: - provider_id: milvus provider_type: remote::milvus config: uri: "http://<host>:<port>" token: "<user>:<password>" secure: True server_pem_path: "path/to/server.pem" ``` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan I have already tested it by connecting to a Milvus instance which is TLS enabled and i was able to start llama stack server .	2025-05-06 14:15:34 +02:00
Ben Browning	f1b103e6c8	fix: openai_compat messages system/assistant non-str content (#2095 ) # What does this PR do? When converting OpenAI message content for the "system" and "assistant" roles to Llama Stack inference APIs (used for some providers when dealing with Llama models via OpenAI API requests to get proper prompt / tool handling), we were not properly converting any non-string content. I discovered this while running the new Responses AI verification suite against the Fireworks provider, but instead of fixing it as part of some ongoing work there split this out into a separate PR. This fixes that, by using the `openai_content_to_content` helper we used elsewhere to ensure content parts were mapped properly. ## Test Plan I added a couple of new tests to `test_openai_compat` to reproduce this issue and validate its fix. I ran those as below: ``` python -m pytest -s -v tests/unit/providers/utils/inference/test_openai_compat.py ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-05-02 13:09:27 -07:00
Ashwin Bharambe	272d3359ee	fix: remove code interpeter implementation (#2087 ) # What does this PR do? The builtin implementation of code interpreter is not robust and has a really weak sandboxing shell (the `bubblewrap` container). Given the availability of better MCP code interpreter servers coming up, we should use them instead of baking an implementation into the Stack and expanding the vulnerability surface to the rest of the Stack. This PR only does the removal. We will add examples with how to integrate with MCPs in subsequent ones. ## Test Plan Existing tests.	2025-05-01 14:35:08 -07:00
Ihar Hrachyshka	9e6561a1ec	chore: enable pyupgrade fixes (#1806 ) # What does this PR do? The goal of this PR is code base modernization. Schema reflection code needed a minor adjustment to handle UnionTypes and collections.abc.AsyncIterator. (Both are preferred for latest Python releases.) Note to reviewers: almost all changes here are automatically generated by pyupgrade. Some additional unused imports were cleaned up. The only change worth of note can be found under `docs/openapi_generator` and `llama_stack/strong_typing/schema.py` where reflection code was updated to deal with "newer" types. Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-05-01 14:23:50 -07:00
ehhuang	ffe3d0b2cd	fix: nullable param type for function call (#2086 ) Nullable param type is not supported, e.g. ['string', 'null'], since it fails type validation. Tests: Run inference with messages: - content: You are a helpful assistant that can use tools to get information. role: system - content: What's the temperature in San Francisco in celsius? role: user tools: - function: description: Get current temperature for a given location. name: get_weather parameters: additionalProperties: false properties: location: description: "City and country e.g. Bogot\xE1, Colombia" type: string unit: description: "Unit of temperature, default to celsius" type: [string, "null"] # <= nullable type required: - location type: object type: function Co-authored-by: Eric Huang <erichuang@fb.com>	2025-05-01 13:17:36 -07:00
Matthew Farrellee	88a796ca5a	fix: allow use of models registered at runtime (#1980 ) # What does this PR do? fix a bug where models registered at runtime could not be used. ``` $ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct $ curl http://localhost:8321/v1/openai/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "test-model", "messages": [{"role": "user", "content": "What is the weather like in Boston today?"}] }' =(client)=> {"detail":"Internal server error: An unexpected error occurred."} =(server)=> TypeError: Missing required arguments; Expected either ('messages' and 'model') or ('messages', 'model' and 'stream') arguments to be given ``` root cause: test-model is not added to ModelRegistryHelper's alias_to_provider_id_map. as part of the fix, this adds tests for ModelRegistryHelper and defines its expected behavior. user visible behavior changes - \| action \| existing behavior \| new behavior \| \| -- \| -- \| -- \| \| double register \| success (but no change) \| error \| \| register unknown \| success (fail when used) \| error \| existing behavior for register unknown model and double register - ``` $ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct-unknown Successfully registered model test-model $ llama-stack-client models list \| grep test-model │ llm │ test-model │ meta/llama-3.1-70b-instruct-unknown │ │ nv… │ $ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct Successfully registered model test-model $ llama-stack-client models list \| grep test-model │ llm │ test-model │ meta/llama-3.1-70b-instruct-unknown │ │ nv… │ ``` new behavior for register unknown - ``` $ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct-unknown ╭──────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Failed to register model │ │ │ │ Error Type: BadRequestError │ │ Details: Error code: 400 - {'detail': "Invalid value: Model id │ │ 'meta/llama-3.1-70b-instruct-unknown' is not supported. Supported ids are: │ │ meta/llama-3.1-70b-instruct, snowflake/arctic-embed-l, meta/llama-3.2-1b-instruct, │ │ nvidia/nv-embedqa-mistral-7b-v2, meta/llama-3.2-90b-vision-instruct, meta/llama-3.2-3b-instruct, │ │ meta/llama-3.2-11b-vision-instruct, meta/llama-3.1-405b-instruct, meta/llama3-8b-instruct, │ │ meta/llama3-70b-instruct, nvidia/llama-3.2-nv-embedqa-1b-v2, meta/llama-3.1-8b-instruct, │ │ nvidia/nv-embedqa-e5-v5"} │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ``` new behavior for double register - ``` $ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.1-70b-instruct Successfully registered model test-model $ llama-stack-client models register test-model --provider-id nvidia --provider-model-id meta/llama-3.2-1b-instruct ╭──────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Failed to register model │ │ │ │ Error Type: BadRequestError │ │ Details: Error code: 400 - {'detail': "Invalid value: Model id 'test-model' is already │ │ registered. Please use a different id or unregister it first."} │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ ``` ## Test Plan ``` uv run pytest -v tests/unit/providers/utils/test_model_registry.py ```	2025-05-01 12:00:58 -07:00
Derek Higgins	64829947d0	feat: Add temperature support to responses API (#2065 ) # What does this PR do? Add support for the temperature to the responses API ## Test Plan Manually tested simple case unit tests added for simple case and tool calls Signed-off-by: Derek Higgins <derekh@redhat.com>	2025-05-01 11:47:58 -07:00
Ben Browning	6378c2a2f3	fix: resolve BuiltinTools to strings for vllm tool_call messages (#2071 ) # What does this PR do? When the result of a ToolCall gets passed back into vLLM for the model to handle the tool call result (as is often the case in agentic tool-calling workflows), we forgot to handle the case where BuiltinTool calls are not string values but instead instances of the BuiltinTool enum. This fixes that, properly converting those enums to string values before trying to serialize them into an OpenAI chat completion request to vLLM. PR #1931 fixed a bug where we weren't passing these tool calling results back into vLLM, but as a side-effect it created this serialization bug when using BuiltinTools. Closes #2070 ## Test Plan I added a new unit test to the openai_compat unit tests to cover this scenario, ensured the new test failed before this fix, and all the existing tests there plus the new one passed with this fix. ``` python -m pytest -s -v tests/unit/providers/utils/inference/test_openai_compat.py ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-05-01 08:47:29 -04:00
Sébastien Han	dc94433072	feat(pre-commit): enhance pre-commit hooks with additional checks (#2014 ) # What does this PR do? Add several new pre-commit hooks to improve code quality and security: - no-commit-to-branch: prevent direct commits to protected branches like `main` - check-yaml: validate YAML files - detect-private-key: prevent accidental commit of private keys - requirements-txt-fixer: maintain consistent requirements.txt format and sorting - mixed-line-ending: enforce LF line endings to avoid mixed line endings - check-executables-have-shebangs: ensure executable scripts have shebangs - check-json: validate JSON files - check-shebang-scripts-are-executable: verify shebang scripts are executable - check-symlinks: validate symlinks and report broken ones - check-toml: validate TOML files mainly for pyproject.toml The respective fixes have been included. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-30 11:35:49 -07:00
Nathan Weinberg	d897313e0b	feat: add additional logging to llama stack build (#1689 ) # What does this PR do? Partial revert of `fa68ded07c` this commit ensures users know where their new templates are generated and how to run the newly built distro locally discussion on Discord: `1351652390` ## Test Plan Did a local run - let me know if we want any unit testing covering this ![Screenshot from 2025-03-18 22-38-18](https://github.com/user-attachments/assets/6d5dac52-edad-4a84-992f-a3c23cda10c8) ## Documentation Updated "Zero to Hero" guide with new output --------- Signed-off-by: Nathan Weinberg <nweinber@redhat.com>	2025-04-30 11:06:24 -07:00
Sébastien Han	2c7aba4158	fix: enforce stricter ASCII rules lint rules in Ruff (#2062 ) # What does this PR do? - Added new Ruff lint rules to detect ambiguous or non-ASCII characters: - Added per-file ignores where Unicode usage is still required. - Fixed whatever had to be fixed Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-30 18:05:27 +02:00
Jash Gulabrai	eab550f7d2	fix: Fix messages format in NVIDIA safety check request body (#2063 ) # What does this PR do? When running a Llama Stack server and invoking the `/v1/safety/run-shield` endpoint, the NVIDIA Guardrails endpoint in some cases errors with a `422: Unprocessable Entity` due to malformed input. For example, given an request body like: ``` { "model": "test", "messages": [ { "role": "user", "content": "You are stupid." } ] } ``` `convert_pydantic_to_json_value` converts the message to: ``` { "role": "user", "content": "You are stupid.", "context": null } ``` Which causes NVIDIA Guardrails to return an error `HTTPError: 422 Client Error: Unprocessable Entity for url: http://nemo.test/v1/guardrail/checks`, because `context` shouldn't be included in the body. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan I ran the Llama Stack server locally and manually verified that the endpoint now succeeds. ``` message = {"role": "user", "content": "You are stupid."} response = client.safety.run_shield(messages=[message], shield_id=shield_id, params={}) ``` Server logs: ``` 14:29:09.656 [START] /v1/safety/run-shield INFO: 127.0.0.1:54616 - "POST /v1/safety/run-shield HTTP/1.1" 200 OK 14:29:09.918 [END] /v1/safety/run-shield [StatusCode.OK] (262.26ms ``` [//]: # (## Documentation) Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-30 18:01:28 +02:00
Sébastien Han	4412694018	chore: Remove zero-width space characters from OTEL service name env var defaults (#2060 ) # What does this PR do? Replaced `${env.OTEL_SERVICE_NAME:\u200B}` and similar variants with properly formatted `${env.OTEL_SERVICE_NAME:}` across all YAML templates and TelemetryConfig. This prevents silent parsing issues and ensures consistent environment variable resolution. Slipped in https://github.com/meta-llama/llama-stack/pull/2058 Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-30 17:56:46 +02:00
Derek Higgins	17b5302543	fix: Fix precommit-hook (#2059 ) Distribution Template Codegen was broken # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: Derek Higgins <derekh@redhat.com>	2025-04-30 12:03:19 +02:00
Roland Huß	5a2bfd6ad5	refactor: Replace SQLITE_DB_PATH by SQLITE_STORE_DIR env in templates (#2055 ) # What does this PR do? The telemetry provider configs is the only one who leverages the env var `SQLITE_DB_PATH` for pointing to persistent data in the respective templates, whereas usually `SQLITE_STORE_DIR` is used. This PR modifies the `sqlite_db_path` in various telemetry configuration files to use the environment variable `SQLITE_STORE_DIR` instead of `SQLITE_DB_PATH`. This change ensures that _only_ the SQLITE_STORE_DIR needs to be set to point to a different persistence location for providers. All references to `SQLITE_DB_PATH` have been removed. Another improvement could be to move `sqlite_db_path` to `db_path` in the telemetry provider config, to align with the other provider configurations. That could be done by another PR (if wanted).	2025-04-29 15:28:10 -07:00
Ashwin Bharambe	4d0bfbf984	feat: add api.llama provider, llama-guard-4 model (#2058 ) This PR adds a llama-stack inference provider for `api.llama.com`, as well as adds entries for Llama-Guard-4 and updated Prompt-Guard models.	2025-04-29 10:07:41 -07:00
Ben Browning	934446ddb4	fix: ollama still using tools with `tool_choice="none"` (#2047 ) # What does this PR do? In our OpenAI API verification tests, ollama was still calling tools even when `tool_choice="none"` was passed in its chat completion requests. Because ollama isn't respecting `tool_choice` properly, this adjusts our provider implementation to remove the `tools` from the request if `tool_choice="none"` is passed in so that it does not attempt to call any of those tools. ## Test Plan I tested this with a couple of Llama models, using both our OpenAI completions integration tests and our verification test suites. ### OpenAI Completions / Chat Completions integration tests These all passed before, and still do. ``` INFERENCE_MODEL="llama3.2:3b-instruct-fp16" \ llama stack build --template ollama --image-type venv --run ``` ``` LLAMA_STACK_CONFIG=http://localhost:8321 \ python -m pytest -v \ tests/integration/inference/test_openai_completion.py \ --text-model "llama3.2:3b-instruct-fp16" ``` ### OpenAI API Verification test suite test_chat_*_tool_choice_none OpenAI API verification tests pass now, when they failed before. See https://github.com/bbrowning/llama-stack-tests/blob/main/openai-api-verification/2025-04-27.md#ollama-llama-stack for an example of these failures from a recent nightly CI run. ``` INFERENCE_MODEL="llama3.3:70b-instruct-q3_K_M" \ llama stack build --template ollama --image-type venv --run ``` ``` cat <<-EOF > tests/verifications/conf/ollama-llama-stack.yaml base_url: http://localhost:8321/v1/openai/v1 api_key_var: OPENAI_API_KEY models: - llama3.3:70b-instruct-q3_K_M model_display_names: llama3.3:70b-instruct-q3_K_M: Llama-3.3-70B-Instruct test_exclusions: llama3.3:70b-instruct-q3_K_M: - test_chat_non_streaming_image - test_chat_streaming_image - test_chat_multi_turn_multiple_images EOF ``` ``` python -m pytest -s -v \ 'tests/verifications/openai_api/test_chat_completion.py' \ --provider=ollama-llama-stack ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-29 10:45:28 +02:00
Kevin Postlethwait	2aca7265b3	fix: add todo for schema validation (#1991 ) # What does this PR do? Change validation to TODO same as was done [here](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/inline/eval/meta_reference/eval.py#L87) until validation can be implemented Closes #1849 ## Test Plan Signed-off-by: Kevin <kpostlet@redhat.com>	2025-04-29 09:59:35 +02:00
Michael Clifford	fe9b5ef08b	fix: tools page on playground resets agent after every interaction (#2044 ) # What does this PR do? This PR updates how the `AgentType` gets set using the radio button on the tools page of the playground. This change is needed due to the fact with its current implementation, the chat interface will resets after every input, preventing users from having a multi-turn conversation with the agent. ## Test Plan Run the Playground without these changes: ```bash streamlit run llama_stack/distribution/ui/app.py ``` Navigate to the tools page and attempt to have a multi-turn conversation. You should see the conversation reset after asking a second question. Repeat the steps above with these changes and you will see that it works as expected when asking the agent multiple questions. Signed-off-by: Michael Clifford <mcliffor@redhat.com>	2025-04-28 23:13:27 +02:00
Ben Browning	8dfce2f596	feat: OpenAI Responses API (#1989 ) # What does this PR do? This provides an initial [OpenAI Responses API](https://platform.openai.com/docs/api-reference/responses) implementation. The API is not yet complete, and this is more a proof-of-concept to show how we can store responses in our key-value stores and use them to support the Responses API concepts like `previous_response_id`. ## Test Plan I've added a new `tests/integration/openai_responses/test_openai_responses.py` as part of a test-driven development for this new API. I'm only testing this locally with the remote-vllm provider for now, but it should work with any of our inference providers since the only API it requires out of the inference provider is the `openai_chat_completion` endpoint. ``` VLLM_URL="http://localhost:8000/v1" \ INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \ llama stack build --template remote-vllm --image-type venv --run ``` ``` LLAMA_STACK_CONFIG="http://localhost:8321" \ python -m pytest -v \ tests/integration/openai_responses/test_openai_responses.py \ --text-model "meta-llama/Llama-3.2-3B-Instruct" ``` --------- Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-04-28 14:06:00 -07:00
Sébastien Han	79851d93aa	feat: Add Kubernetes authentication (#1778 ) # What does this PR do? This commit adds a new authentication system to the Llama Stack server with support for Kubernetes and custom authentication providers. Key changes include: - Implemented KubernetesAuthProvider for validating Kubernetes service account tokens - Implemented CustomAuthProvider for validating tokens against external endpoints - this is the same code that was already present. - Added test for Kubernetes - Updated server configuration to support authentication settings - Added documentation for authentication configuration and usage The authentication system supports: - Bearer token validation - Kubernetes service account token validation - Custom authentication endpoints ## Test Plan Setup a Kube cluster using Kind or Minikube. Run a server with: ``` server: port: 8321 auth: provider_type: kubernetes config: api_server_url: http://url ca_cert_path: path/to/cert (optional) ``` Run: ``` curl -s -L -H "Authorization: Bearer $(kubectl create token my-user)" http://127.0.0.1:8321/v1/providers ``` Or replace "my-user" with your service account. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-28 22:24:58 +02:00
Rashmi Pawar	e6bbf8d20b	feat: Add NVIDIA NeMo datastore (#1852 ) # What does this PR do? Implemetation of NeMO Datastore register, unregister API. Open Issues: - provider_id gets set to `localfs` in client.datasets.register() as it is specified in routing_tables.py: DatasetsRoutingTable see: #1860 Currently I have passed `"provider_id":"nvidia"` in metadata and have parsed that in `DatasetsRoutingTable` (Not the best approach, but just a quick workaround to make it work for now.) ## Test Plan - Unit test cases: `pytest tests/unit/providers/nvidia/test_datastore.py` ```bash ========================================================== test session starts =========================================================== platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0 rootdir: /home/ubuntu/llama-stack configfile: pyproject.toml plugins: anyio-4.9.0, asyncio-0.26.0, nbval-0.11.0, metadata-3.1.1, html-4.1.1, cov-6.1.0 asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function collected 2 items tests/unit/providers/nvidia/test_datastore.py .. [100%] ============================================================ warnings summary ============================================================ ====================================================== 2 passed, 1 warning in 0.84s ====================================================== ``` cc: @dglogo, @mattf, @yanxi0830	2025-04-28 09:41:59 -07:00
Sajikumar JS	6cf6791de1	fix: updated watsonx inference chat apis with new repo changes (#2033 ) # What does this PR do? There are new changes in repo which needs to add some additional functions to the inference which is fixed. Also need one additional params to pass some extra arguments to watsonx.ai [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) --------- Co-authored-by: Sajikumar JS <sajikumar.js@ibm.com>	2025-04-26 10:17:52 -07:00
ehhuang	0266b20535	docs: update prompt_format.md for llama4 (#2035 ) torchrun --nproc_per_node=8 scripts/generate_prompt_format.py meta-llama/Llama-4-Scout-17B-16E-Instruct ~/local/checkpoints/<path>/ llama_stack.models.llama.llama4.prompts llama_stack/models/llama/llama4/prompt_format.md Co-authored-by: Eric Huang <erichuang@fb.com>	2025-04-25 15:52:15 -07:00
Jash Gulabrai	8713d67ce3	fix: Correctly parse algorithm_config when launching NVIDIA customization job; fix internal request handler (#2025 ) # What does this PR do? This addresses 2 bugs I ran into when launching a fine-tuning job with the NVIDIA Adapter: 1. Session handling in `_make_request` helper function returns an error. ``` INFO: 127.0.0.1:55831 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 500 Internal Server Error 16:11:45.643 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (270.44ms) 16:11:45.643 [ERROR] Error executing endpoint route='/v1/post-training/supervised-fine-tune' method='post' Traceback (most recent call last): File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 201, in endpoint return await maybe_await(value) File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 161, in maybe_await return await value File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 408, in supervised_fine_tune response = await self._make_request( File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 98, in _make_request async with self.session.request(method, url, params=params, json=json, **kwargs) as response: File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 1425, in __aenter__ self._resp: _RetType = await self._coro File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/client.py", line 579, in _request handle = tm.start() File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/aiohttp/helpers.py", line 587, in start return self._loop.call_at(when, self.__call__) File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 724, in call_at self._check_closed() File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 510, in _check_closed raise RuntimeError('Event loop is closed') RuntimeError: Event loop is closed ``` Note: This only occurred when initializing the client like so: ``` client = LlamaStackClient( base_url="http://0.0.0.0:8321" ) response = client.post_training.supervised_fine_tune(...) # Returns error ``` I didn't run into this issue when using the library client: ``` client = LlamaStackAsLibraryClient("nvidia") client.initialize() response = client.post_training.supervised_fine_tune(...) # Works fine ``` 2. The `algorithm_config` param in `supervised_fine_tune` is parsed as a `dict` when run from unit tests, but a Pydantic model when invoked using the Llama Stack client. So, the call fails outside of unit tests: ``` INFO: 127.0.0.1:54024 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 500 Internal Server Error 21:14:02.315 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (71.18ms) 21:14:02.314 [ERROR] Error executing endpoint route='/v1/post-training/supervised-fine-tune' method='post' Traceback (most recent call last): File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 205, in endpoint return await maybe_await(value) File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/distribution/server/server.py", line 164, in maybe_await return await value File "/Users/jgulabrai/Projects/forks/llama-stack/llama_stack/providers/remote/post_training/nvidia/post_training.py", line 407, in supervised_fine_tune "adapter_dim": algorithm_config.get("adapter_dim"), File "/Users/jgulabrai/Projects/forks/llama-stack/.venv/lib/python3.10/site-packages/pydantic/main.py", line 891, in __getattr__ raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}') AttributeError: 'LoraFinetuningConfig' object has no attribute 'get' ``` The code assumes `algorithm_config` should be `dict`, so I just handle both cases. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan 1. I ran a local Llama Stack server with the necessary env vars: ``` lama stack run llama_stack/templates/nvidia/run.yaml --port 8321 --env ... ``` And invoked `supervised_fine_tune` to confirm neither of the errors above occur. ``` client = LlamaStackClient( base_url="http://0.0.0.0:8321" ) response = client.post_training.supervised_fine_tune(...) ``` 2. I confirmed the unit tests still pass: `./scripts/unit-tests.sh tests/unit/providers/nvidia/test_supervised_fine_tuning.py` [//]: # (## Documentation) --------- Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-25 13:21:50 -07:00
ehhuang	1b2e116a2a	fix: tool call encoded twice (#2034 ) # What does this PR do? ## Test Plan LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B --vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct	2025-04-25 13:16:16 -07:00
Derek Higgins	0e4307de0f	docs: Fix missing --gpu all flag in Docker run commands (#2026 ) adding the --gpu all flag to Docker run commands for meta-reference-gpu distributions ensures models are loaded into GPU instead of CPU. Remove docs for meta-reference-quantized-gpu The distribution was removed in #1887 but these files were left behind. Fixes: #1798 # What does this PR do? Fixes doc to add --gpu all command to docker run [//]: # (If resolving an issue, uncomment and update the line below) Closes #1798 ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] verified in docker documentation but untested --------- Signed-off-by: Derek Higgins <derekh@redhat.com>	2025-04-25 12:17:31 -07:00
Sajikumar JS	1bb1d9b2ba	feat: Add watsonx inference adapter (#1895 ) # What does this PR do? IBM watsonx ai added as the inference [#1741 ](https://github.com/meta-llama/llama-stack/issues/1741) [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) --------- Co-authored-by: Sajikumar JS <sajikumar.js@ibm.com>	2025-04-25 11:29:21 -07:00
ehhuang	29072f40ab	feat: new system prompt for llama4 (#2031 ) Tests: LLAMA_STACK_CONFIG=http://localhost:5002 pytest -s -v tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B --vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct Co-authored-by: Eric Huang <erichuang@fb.com>	2025-04-25 11:29:08 -07:00
Ashwin Bharambe	4bbd0c0693	fix: add endpoint route debugs	2025-04-25 10:40:12 -07:00
Andy Xie	f5dae0517c	feat: Support ReAct Agent on Tools Playground (#2012 ) # What does this PR do? ReAct prompting attempts to use the Thinking, Action, Observation loop to improve the model's reasoning ability via prompt engineering. With this PR, it now supports the various features in Streamlit's playground: 1. Adding the selection box for choosing between Agent Type: normal, ReAct. 2. Adding the Thinking, Action, Observation loop streamlit logic for ReAct agent, as seen in many LLM clients. 3. Improving tool calling accuracies via ReAct prompting, e.g. using web_search. Folded ![react_output_folded png](https://github.com/user-attachments/assets/bf1bdce7-e6ef-455d-b6b0-c22a64e9d5c1) Collapsed ![react_output_collapsed](https://github.com/user-attachments/assets/cda2fc17-df0b-400d-971c-988de821f2a4) [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] Run the playground and uses reasoning prompts to see for yourself. Steps to test the ReAct agent mode: 1. Setup a llama-stack server as [getting_started](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html) describes. 2. Setup your Web Search API keys under `llama_stack/distribution/ui/modules/api.py`. 3. Run the streamlit playground and try ReAct agent, possibly with `websearch`, with the command: `streamlit run llama_stack/distribution/ui/app.py`. ## Test Process Current results are demonstrated with `llama-3.2-3b-instruct`. Results will vary with different models. You should be seeing clear distinction with normal agent and ReAct agent. Example prompts listed below: 1. Aside from the Apple Remote, what other devices can control the program Apple Remote was originally designed to interact with? 2. What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into? ## Example Test Results Web search on AppleTV <img width="1440" alt="normal_output_appletv" src="https://github.com/user-attachments/assets/bf6b3273-1c94-4976-8b4a-b2d82fe41330" /> <img width="1440" alt="react_output_appletv" src="https://github.com/user-attachments/assets/687f1feb-88f4-4d32-93d5-5013d0d5fe25" /> Web search on Colorado <img width="1440" alt="normal_output_colorado" src="https://github.com/user-attachments/assets/10bd3ad4-f2ad-466d-9ce0-c66fccee40c1" /> <img width="1440" alt="react_output_colorado" src="https://github.com/user-attachments/assets/39cfd82d-2be9-4e2f-9f90-a2c4840185f7" /> Web search tool + MCP Slack server <img width="1250" alt="normal_output_search_slack png" src="https://github.com/user-attachments/assets/72e88125-cdbf-4a90-bcb9-ab412c51d62d" /> <img width="1217" alt="react_output_search_slack" src="https://github.com/user-attachments/assets/8ae04efb-a4fd-49f6-9465-37dbecb6b73e" /> ![slack_screenshot](https://github.com/user-attachments/assets/bb70e669-6067-462a-bdf6-7aaac6ccbcef)	2025-04-25 17:01:51 +02:00
Roland Huß	121c73c2f5	feat(cli): add interactive tab completion for image type selection (#2027 ) # What does this PR do? Enhances the user experience in the `llama stack build` command by adding interactive TAB completion for image type selection. This ensures the UX consistency with other parts of the CLI that already support tab completion, such as provider selection, providing a more intuitive and discoverable interface for users. <img width="1531" alt="image" src="https://github.com/user-attachments/assets/12161d45-451d-4820-b34d-7ea4decf810f" />	2025-04-25 16:57:42 +02:00
Surya Prakash Pathak	59b7593609	feat: Enhance tool display in Tools sidebar by simplifying tool identifiers (#2024 ) # What does this PR do? This PR improves the Tools page in the LlamaStack Playground UI by enhancing the readability of the active tool list shown in the sidebar. - Previously, active tools were displayed in a flat JSON array with verbose identifiers (e.g., builtin::code_interpreter:code_interpreter). - This PR updates the logic to group tools by their toolgroup (e.g., builtin::websearch) and renders each tool name in a simplified, human-readable format (e.g., web_search). - This change improves usability when working with multiple toolgroups, especially in configurations involving MCP tools or complex tool identifiers. Before and After Comparison: Before ![Screenshot 2025-04-24 at 1 05 47 PM](https://github.com/user-attachments/assets/44843a79-49dc-4b4d-ab28-c6187f9bb5ba) After ![Screenshot 2025-04-24 at 1 24 08 PM](https://github.com/user-attachments/assets/ebb01006-e0a9-4664-a95a-e6f72eea6f94) [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan - Followed the [LlamaStack UI Developer Setup instructions](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/distribution/ui) - Ran the Streamlit UI via: `uv run --with "[.ui]" streamlit run llama_stack/distribution/ui/app.py` - Selected multiple built-in toolgroups (e.g., code_interpreter, websearch, wolfram_alpha) from the sidebar. [//]: # (## Documentation)	2025-04-25 10:22:22 +02:00
Rashmi Pawar	ace82836c1	feat: NVIDIA allow non-llama model registration (#1859 ) # What does this PR do? Adds custom model registration functionality to NVIDIAInferenceAdapter which let's the inference happen on: - post-training model - non-llama models in API Catalogue(behind https://integrate.api.nvidia.com and endpoints compatible with AyncOpenAI) ## Example Usage: ```python from llama_stack.apis.models import Model, ModelType from llama_stack.distribution.library_client import LlamaStackAsLibraryClient client = LlamaStackAsLibraryClient("nvidia") _ = client.initialize() client.models.register( model_id=model_name, model_type=ModelType.llm, provider_id="nvidia" ) response = client.inference.chat_completion( model_id=model_name, messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Write a limerick about the wonders of GPU computing."}], ) ``` ## Test Plan ```bash pytest tests/unit/providers/nvidia/test_supervised_fine_tuning.py ========================================================== test session starts =========================================================== platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0 rootdir: /home/ubuntu/llama-stack configfile: pyproject.toml plugins: anyio-4.9.0 collected 6 items tests/unit/providers/nvidia/test_supervised_fine_tuning.py ...... [100%] ============================================================ warnings summary ============================================================ ../miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076 /home/ubuntu/miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'contentEncoding'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/ warn( -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ====================================================== 6 passed, 1 warning in 1.51s ====================================================== ``` [//]: # (## Documentation) Updated Readme.md cc: @dglogo, @sumitb, @mattf	2025-04-24 17:13:33 -07:00
Jash Gulabrai	cc77f79f55	feat: Add NVIDIA Eval integration (#1890 ) # What does this PR do? This PR adds support for NVIDIA's NeMo Evaluator API to the Llama Stack eval module. The integration enables users to evaluate models via the Llama Stack interface. ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] 1. Added unit tests and successfully ran from root of project: `./scripts/unit-tests.sh tests/unit/providers/nvidia/test_eval.py` ``` tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_cancel PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_result PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_status PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_register_benchmark PASSED tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_run_eval PASSED ``` 2. Verified I could build the Llama Stack image: `LLAMA_STACK_DIR=$(pwd) llama stack build --template nvidia --image-type venv` Documentation added to `llama_stack/providers/remote/eval/nvidia/README.md` --------- Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-24 17:12:42 -07:00
Ben Browning	0b6cd45950	fix: Additional streaming error handling (#2007 ) # What does this PR do? This expands the `test_sse` test suite and fixes some edge cases with bugs in our SSE error handling to ensure streaming clients always get a proper error response. First, we handle the case where a client disconnects before we actually start streaming the response back. Previously we only handled the case where a client disconnected as we were streaming the response, but there was an edge case where a client disconnecting before we streamed any response back did not trigger our logic to cleanly handle that disconnect. Second, we handle the case where an error is thrown from the server before the actual async generator gets created from the provider. This happens in scenarios like the newly merged OpenAI API input validation, where we eagerly raise validation errors before returning the async generator object that streams the responses back. ## Test Plan Tested via: ``` python -m pytest -s -v tests/unit/server/test_sse.py ``` Both test cases failed before, and passed afterwards. The test cases were written based on me experimenting with actual clients that would do bad things like randomly disconnect or send invalid input in streaming mode and I hit these two cases, where things were misbehaving in our error handling. Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-24 17:01:45 -07:00
Derek Higgins	c8797f1125	fix: Including tool call in chat (#1931 ) Include the tool call details with the chat when doing Rag with Remote vllm Fixes: #1929 With this PR the tool call is included in the chat returned to vllm, the model (meta-llama/Llama-3.1-8B-Instruct) the returns the answer as expected. Signed-off-by: Derek Higgins <derekh@redhat.com>	2025-04-24 16:59:10 -07:00
ehhuang	7ed137e963	fix: meta ref inference (#2022 ) MAX_BATCH_SIZE=10 LLAMA_MODELS_DEBUG=1 LLAMA_STACK_PORT=5002 LLAMA_STACK_LOGGING='all=info' llama stack run meta-reference-gpu --env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct --env INFERENCE_CHECKPOINT_DIR=... LLAMA_STACK_CONFIG=http://localhost:5002/ pytest -s -v tests/integration/inference --safety-shield meta-llama/Llama-Guard-3-8B --vision-model meta-llama/Llama-4-Scout-17B-16E-Instruct --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct Co-authored-by: Eric Huang <erichuang@fb.com>	2025-04-24 13:03:35 -07:00
Ashwin Bharambe	a5d6ab16b2	fix: meta-reference parallel utils bug, use isinstance not equality	2025-04-24 11:27:49 -07:00
Ilya Kolchinsky	e664ba91d8	fix: prevent the knowledge search tool from confusing the model with long content (#1908 ) # What does this PR do? This PR addresses the content dominance problem that frequently arises with multiple models when executing queries with the RAG tool. When the retrieved content is too large, it disproportionately influences the generation process, causing the model to ignore the original question and to provide meaningless comments on the retrieved information instead. This situation is especially common with agentic RAG, which is the standard way of doing RAG in Llama Stack, since directly manipulating the prompt combining the query with the retrieved content is not possible. This PR appends a grounding message to the results returned by the knowledge search tool, reminding the model about the original query and the purpose of the inference call. This makes the problem significantly less likely to occur. ## Test Plan Running the following script before the fix demonstrates the content dominance problem where the model insists to comment on the retrieved content and refuses to address the question. Running the script after the fix results in getting the correct answer. ``` import os import uuid from llama_stack_client import Agent, AgentEventLogger, RAGDocument, LlamaStackClient # the server endpoint LLAMA_STACK_SERVER_URL = "http://localhost:8321" # inference settings MODEL_ID = ""meta-llama/Llama-3.1-8B-Instruct" SYSTEM_PROMPT = "You are a helpful assistant. " # RAG settings VECTOR_DB_EMBEDDING_MODEL = "all-MiniLM-L6-v2" VECTOR_DB_EMBEDDING_DIMENSION = 384 VECTOR_DB_CHUNK_SIZE = 512 # initialize the server connection client = LlamaStackClient(base_url=os.environ.get("LLAMA_STACK_ENDPOINT", LLAMA_STACK_SERVER_URL)) # init the RAG retrieval parameters vector_db_id = f"test_vector_db_{uuid.uuid4()}" vector_providers = [ provider for provider in client.providers.list() if provider.api == "vector_io" ] vector_provider_to_use = vector_providers[0] # define and register the document collection to be used client.vector_dbs.register( vector_db_id=vector_db_id, embedding_model=VECTOR_DB_EMBEDDING_MODEL, embedding_dimension=VECTOR_DB_EMBEDDING_DIMENSION, provider_id=vector_provider_to_use.provider_id, ) # ingest the documents into the newly created document collection urls = [ ("https://www.openshift.guide/openshift-guide-screen.pdf", "application/pdf"), ] documents = [ RAGDocument( document_id=f"num-{i}", content=url, mime_type=url_type, metadata={}, ) for i, (url, url_type) in enumerate(urls) ] client.tool_runtime.rag_tool.insert( documents=documents, vector_db_id=vector_db_id, chunk_size_in_tokens=VECTOR_DB_CHUNK_SIZE, ) queries = [ "How to install OpenShift?", ] # initializing the agent agent = Agent( client, model=MODEL_ID, instructions=SYSTEM_PROMPT, # we make our agent aware of the RAG tool by including builtin::rag/knowledge_search in the list of tools tools=[ dict( name="builtin::rag/knowledge_search", args={ "vector_db_ids": [vector_db_id], # list of IDs of document collections to consider during retrieval }, ) ], ) for prompt in queries: print(f"User> {prompt}") # create a new turn with a new session ID for each prompt response = agent.create_turn( messages=[ { "role": "user", "content": prompt, } ], session_id=agent.create_session(f"rag-session_{uuid.uuid4()}") ) # print the response, including tool calls output for log in AgentEventLogger().log(response): print(log.content, end='') ```	2025-04-24 16:38:38 +02:00
Sébastien Han	14e60e3c02	feat: include run.yaml in the container image (#2005 ) As part of the build process, we now include the generated run.yaml (based of the provided build configuration file) into the container. We updated the entrypoint to use this run configuration as well. Given this simple distribution configuration: ``` # build.yaml version: '2' distribution_spec: description: Use (an external) Ollama server for running LLM inference providers: inference: - remote::ollama vector_io: - inline::faiss safety: - inline::llama-guard agents: - inline::meta-reference telemetry: - inline::meta-reference eval: - inline::meta-reference datasetio: - remote::huggingface - inline::localfs scoring: - inline::basic - inline::llm-as-judge - inline::braintrust tool_runtime: - remote::brave-search - remote::tavily-search - inline::code-interpreter - inline::rag-runtime - remote::model-context-protocol - remote::wolfram-alpha container_image: "registry.access.redhat.com/ubi9" image_type: container image_name: test ``` Build it: ``` llama stack build --config build.yaml ``` Run it: ``` podman run --rm \ -p 8321:8321 \ -e OLLAMA_URL=http://host.containers.internal:11434 \ --name llama-stack-server \ localhost/leseb-test:0.2.2 ``` Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-24 11:29:53 +02:00
Ben Browning	fa5dfee07b	fix: Return HTTP 400 for OpenAI API validation errors (#2002 ) # What does this PR do? When clients called the Open AI API with invalid input that wasn't caught by our own Pydantic API validation but instead only caught by the backend inference provider, that backend inference provider was returning a HTTP 400 error. However, we were wrapping that into a HTTP 500 error, obfuscating the actual issue from calling clients and triggering OpenAI client retry logic. This change adjusts our existing `translate_exception` method in `server.py` to wrap `openai.BadRequestError` as HTTP 400 errors, passing through the string representation of the error message to the calling user so they can see the actual input validation error and correct it. I tried changing this in a few other places, but ultimately `translate_exception` was the only real place to handle this for both streaming and non-streaming requests across all inference providers that use the OpenAI server APIs. This also tightens up our validation a bit for the OpenAI chat completions API, to catch empty `messages` parameters, invalid `tool_choice` parameters, invalid `tools` items, or passing `tool_choice` when `tools` isn't given. Lastly, this extends our OpenAI API chat completions verifications to also check for consistent input validation across providers. Providers behind Llama Stack should automatically pass all the new tests due to the input validation added here, but some of the providers fail this test when not run behind Llama Stack due to differences in how they handle input validation and errors. (Closes #1951) ## Test Plan To test this, start an OpenAI API verification stack: ``` llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml ``` Then, run the new verification tests with your provider(s) of choice: ``` python -m pytest -s -v \ tests/verifications/openai_api/test_chat_completion.py \ --provider openai-llama-stack python -m pytest -s -v \ tests/verifications/openai_api/test_chat_completion.py \ --provider together-llama-stack ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-23 17:48:32 +02:00
Michael Clifford	64f747fe09	feat: add tool name to chat output in playground (#1996 ) # What does this PR do? This PR adds the name of the tool that is used by the agent on the "tools" page of the playground. See image below for an example. ![Screenshot 2025-04-18 at 3 14 18 PM](https://github.com/user-attachments/assets/04e97783-4003-4121-9446-9e0ad7209256) ## Test Plan Run the playground and navigate to the tools page. There users can see that this additional text is present when tools are invoked and absent when they are not. ``` streamlit run llama_stack/distribution/ui/app.py ``` Signed-off-by: Michael Clifford <mcliffor@redhat.com>	2025-04-23 15:57:54 +02:00
Ben Browning	dc46725f56	fix: properly handle streaming client disconnects (#2000 ) # What does this PR do? Previously, when a streaming client would disconnect before we were finished streaming the entire response, an error like the below would get raised from the `sse_generator` function in `llama_stack/distribution/server/server.py`: ``` AttributeError: 'coroutine' object has no attribute 'aclose'. Did you mean: 'close'? ``` This was because we were calling `aclose` on a coroutine instead of the awaited value from that coroutine. This change fixes that, so that we save off the awaited value and then can call `aclose` on it if we encounter an `asyncio.CancelledError`, like we see when a client disconnects before we're finished streaming. The other changes in here are to add a simple set of tests for the happy path of our SSE streaming and this client disconnect path. That unfortunately requires adding one more dependency into our unit test section of pyproject.toml since `server.py` requires loading some of the telemetry code for me to test this functionality. ## Test Plan I wrote the tests in `tests/unit/server/test_sse.py` first, verified the client disconnected test failed before my change, and that it passed afterwards. ``` python -m pytest -s -v tests/unit/server/test_sse.py ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-23 15:44:28 +02:00

1 2 3 4 5 ...

1208 commits