llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-03 09:53:45 +00:00

Author	SHA1	Message	Date
Jash Gulabrai	30fc66923b	fix: Add llama-3.2-1b-instruct to NVIDIA fine-tuned model list (#1975 ) # What does this PR do? Adds `meta/llama-3.2-1b-instruct` to list of models that NeMo Customizer can fine-tune. This is the model our example notebooks typically use for fine-tuning. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-16 15:02:08 -07:00
Francisco Arceo	00b232c282	chore: Fix to persist the theme preference across page navigation. (#1974 ) # What does this PR do? This PR persists the theme preference across page navigation. Currently, if the default theme is detected, it is used. But if a user flips _the default theme_ and goes to a new page, the theme will switch back to the default. This resolves that issue. ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>	2025-04-16 13:58:25 -07:00
Daniel Alvarez Sanchez	b5a9ef4c6d	fix: Do not send an empty 'tools' list to remote vllm (#1957 ) Fixes: #1955 Since 0.2.0, the vLLM gets an empty list (vs ``None``in 0.1.9 and before) when there are no tools configured which causes the issue described in #1955 p. This patch avoids sending the 'tools' param to the vLLM altogether instead of an empty list. It also adds a small unit test to avoid regressions. The OpenAI [specification](https://platform.openai.com/docs/api-reference/chat/create) does not explicitly state that the list cannot be empty but I found this out through experimentation and it might depend on the actual remote vllm. In any case, as this parameter is Optional, is best to skip it altogether if there's no tools configured. Signed-off-by: Daniel Alvarez <dalvarez@redhat.com>	2025-04-15 20:31:12 -04:00
Chirag Modi	fb8ff77ff2	docs: 0.2.2 doc updates (#1961 ) Add updates to android site readme for 0.2.2	2025-04-15 13:26:17 -07:00
Michael Clifford	093881071a	fix: add max_tokens slider to playground tools page (#1958 ) # What does this PR do? This PR adds a `max_tokens` slider to playground tools page. I have found that in some instances the llama stack server throws a 500 error if the max_tokens value is not explicitly set in the agent's `sampling_params`. This PR, uses the same implementation of the `max_tokens` slider from the chat page, and includes it on the tools page. ## Test Plan 1. Attempting to call a tool without these changes results in a `500: Internal server error: An unexpected error occurred`. 2. Attempting to call a tool with these changes results in the expected output. Signed-off-by: Michael Clifford <mcliffor@redhat.com>	2025-04-15 09:11:08 -07:00
Dmitry Rogozhkin	71ed47ea76	docs: add example for intel gpu in vllm remote (#1952 ) # What does this PR do? PR adds instructions to setup vLLM remote endpoint for vllm-remote llama stack distribution. ## Test Plan * Verified with manual tests of the configured vllm-remote against vllm endpoint running on the system with Intel GPU * Also verified with ci pytests (see cmdline below). Test passes in the same capacity as it does on the A10 Nvidia setup (some tests do fail which seems to be known issues with vllm remote llama stack distribution) ``` pytest -s -v tests/integration/inference/test_text_inference.py \ --stack-config=http://localhost:5001 \ --text-model=meta-llama/Llama-3.2-3B-Instruct ``` CC: @ashwinb Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>	2025-04-15 07:56:23 -07:00
Charlie Doern	83b5523e2d	feat: add `--providers` to llama stack build (#1718 ) # What does this PR do? allow users to specify only the providers they want in the llama stack build command. If a user wants a non-interactive build, but doesn't want to use a template, `--providers` allows someone to specify something like `--providers inference=remote::ollama` for a distro with JUST ollama ## Test Plan `llama stack build --providers inference=remote::ollama --image-type venv` <img width="1084" alt="Screenshot 2025-03-20 at 9 34 14 AM" src="https://github.com/user-attachments/assets/502b5fa2-edab-4267-a595-4f987204a6a9" /> `llama stack run --image-type venv /Users/charliedoern/projects/Documents/llama-stack/venv-run.yaml` <img width="1149" alt="Screenshot 2025-03-20 at 9 35 19 AM" src="https://github.com/user-attachments/assets/433765f3-6b7f-4383-9241-dad085b69228" /> --------- Signed-off-by: Charlie Doern <cdoern@redhat.com> Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Sébastien Han <seb@redhat.com>	2025-04-15 14:17:03 +02:00
ehhuang	32e3da7392	test(verification): more tests, multiturn tool use tests (#1954 ) # What does this PR do? ## Test Plan (myenv) ➜ llama-stack python tests/verifications/generate_report.py --providers fireworks,together,openai --run-tests `f27f617629/tests/verifications/REPORT.md`	2025-04-14 18:45:22 -07:00
Peter Double	86c6f1f112	fix: FastAPI built-in paths bypass custom routing (Docs) and update r… (#1841 ) ## What does this PR do? This PR improves the server's request routing logic by ensuring built-in FastAPI paths such as `/docs`, `/redoc`, `/openapi.json`, `/favicon.ico`, and `/static` bypass the custom `TracingMiddleware`. This prevents unnecessary tracing logic for documentation and static file requests, ensuring better performance and cleaner logs. Additionally, it adds proper metadata (`title`, `description`, and `version`) to the FastAPI application initialization and updates the requirements document accordingly. [//]: # (Closes #1822 ) --- ## Test Plan - Ran the server locally with `uvicorn` using the provided `run.yaml` config - Verified that: - FastAPI docs (`/docs`, `/redoc`) load correctly without triggering the custom tracing middleware - All other routes still go through the middleware and trace logic - Application metadata appears as expected in the OpenAPI docs To reproduce: 1. Start the server with `python server.py --template <template-name>` 2. Navigate to `/docs` and `/redoc` 3. Confirm that no extra trace headers are added for those routes 4. Confirm other API endpoints behave as expected and include `x-trace-id` in the response headers [//]: # (## Documentation) --- Froze the requirements file to include many of the other libraries that have been added in the past few releases to make install easier. --------- Co-authored-by: Sébastien Han <seb@redhat.com>	2025-04-14 13:28:25 -04:00
Nathan Weinberg	cf158f2cb9	feat: allow ollama to use 'latest' if available but not specified (#1903 ) # What does this PR do? ollama's CLI supports running models via commands such as 'ollama run llama3.2' this syntax does not work with the INFERENCE_MODEL llamastack var as currently specifying a tag such as 'latest' is required this commit will check to see if the 'latest' model is available and use that model if a user passes a model name without a tag but the 'latest' is available in ollama ## Test Plan Behavior pre-code change ```bash $ INFERENCE_MODEL=llama3.2 llama stack build --template ollama --image-type venv --run ... INFO 2025-04-08 13:42:42,842 llama_stack.providers.remote.inference.ollama.ollama:80 inference: checking connectivity to Ollama at `http://beanlab1.bss.redhat.com:11434`... Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/server/server.py", line 502, in <module> main() File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/server/server.py", line 401, in main impls = asyncio.run(construct_stack(config)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/asyncio/runners.py", line 195, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/asyncio/base_events.py", line 691, in run_until_complete return future.result() ^^^^^^^^^^^^^^^ File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/stack.py", line 222, in construct_stack await register_resources(run_config, impls) File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/stack.py", line 99, in register_resources await method(*obj.model_dump()) File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper result = await method(self, args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 294, in register_model registered_model = await self.register_object(model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 228, in register_object registered_obj = await register_object_with_provider(obj, p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 77, in register_object_with_provider return await p.register_model(obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper result = await method(self, args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/nathan/ai/llama-stack/repos/llama-stack/llama_stack/providers/remote/inference/ollama/ollama.py", line 315, in register_model raise ValueError( ValueError: Model 'llama3.2' is not available in Ollama. Available models: llama3.2:latest ++ error_handler 108 ++ echo 'Error occurred in script at line: 108' Error occurred in script at line: 108 ++ exit 1 ``` Behavior post-code change ```bash $ INFERENCE_MODEL=llama3.2 llama stack build --template ollama --image-type venv --run ... INFO 2025-04-08 13:58:17,365 llama_stack.providers.remote.inference.ollama.ollama:80 inference: checking connectivity to Ollama at `http://beanlab1.bss.redhat.com:11434`... WARNING 2025-04-08 13:58:18,190 llama_stack.providers.remote.inference.ollama.ollama:317 inference: Imprecise provider resource id was used but 'latest' is available in Ollama - using 'llama3.2:latest' INFO 2025-04-08 13:58:18,191 llama_stack.providers.remote.inference.ollama.ollama:308 inference: Pulling embedding model `all-minilm:latest` if necessary... INFO 2025-04-08 13:58:18,799 __main__:478 server: Listening on ['::', '0.0.0.0']:8321 INFO: Started server process [28378] INFO: Waiting for application startup. INFO 2025-04-08 13:58:18,803 __main__:148 server: Starting up INFO: Application startup complete. INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) ... ``` ## Documentation Did not document this anywhere but happy to do so if there is an appropriate place Signed-off-by: Nathan Weinberg <nweinber@redhat.com>	2025-04-14 09:03:54 -07:00
Ihar Hrachyshka	3ed4316ed5	feat: Implement async job execution for torchtune training (#1437 ) # What does this PR do? Now a separate thread is started to execute training jobs. Training requests now return job ID before the job completes. (Which fixes API timeouts for any jobs that take longer than a minute.) Note: the scheduler code is meant to be spun out in the future into a common provider service that can be reused for different APIs and providers. It is also expected to back the /jobs API proposed here: https://github.com/meta-llama/llama-stack/discussions/1238 Hence its somewhat generalized form which is expected to simplify its adoption elsewhere in the future. Note: this patch doesn't attempt to implement missing APIs (e.g. cancel or job removal). This work will belong to follow-up PRs. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] Added unit tests for the scheduler module. For the API coverage, did manual testing and was able to run a training cycle on GPU. The initial call returned job ID before the training completed, as (now) expected. Artifacts are returned as expected. ``` JobArtifactsResponse(checkpoints=[{'identifier': 'meta-llama/Llama-3.2-3B-Instruct-sft-0', 'created_at': '2025-03-07T22:45:19.892714', 'epoch': 0, 'post_training_job_id': 'test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50', 'path': '/home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0', 'training_metrics': None}], job_uuid='test-job2ee77104-2fd3-4a4e-84cf-f83f8b8f1f50') ``` The integration test is currently disabled for the provider. I will look into how it can be enabled in a different PR / issue context. [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-04-14 08:59:11 -07:00
Ben Browning	7641a5cd0b	fix: 100% OpenAI API verification for together and fireworks (#1946 ) # What does this PR do? TLDR: Changes needed to get 100% passing tests for OpenAI API verification tests when run against Llama Stack with the `together`, `fireworks`, and `openai` providers. And `groq` is better than before, at 88% passing. This cleans up the OpenAI API support for image message types (specifically `image_url` types) and handling of the `response_format` chat completion parameter. Both of these required a few more Pydantic model definitions in our Inference API, just to move from the not-quite-right stubs I had in place to something fleshed out to match the actual OpenAI API specs. As part of testing this, I also found and fixed a bug in the litellm implementation of openai_completion and openai_chat_completion, so the providers based on those should actually be working now. The method `prepare_openai_completion_params` in `llama_stack/providers/utils/inference/openai_compat.py` was improved to actually recursively clean up input parameters, including handling of lists, dicts, and dumping of Pydantic models to dicts. These changes were required to get to 100% passing tests on the OpenAI API verification against the `openai` provider. With the above, the together.ai provider was passing as well as it is without Llama Stack. But, since we have Llama Stack in the middle, I took the opportunity to clean up the together.ai provider so that it now also passes the OpenAI API spec tests we have at 100%. That means together.ai is now passing our verification test better when using an OpenAI client talking to Llama Stack than it is when hitting together.ai directly, without Llama Stack in the middle. And, another round of work for Fireworks to improve translation of incoming OpenAI chat completion requests to Llama Stack chat completion requests gets the fireworks provider passing at 100%. The server-side fireworks.ai tool calling support with OpenAI chat completions and Llama 4 models isn't great yet, but by pointing the OpenAI clients at Llama Stack's API we can clean things up and get everything working as expected for Llama 4 models. ## Test Plan ### OpenAI API Verification Tests I ran the OpenAI API verification tests as below and 100% of the tests passed. First, start a Llama Stack server that runs the `openai` provider with the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template setup to do this out of the box, so I added a `tests/verifications/openai-api-verification-run.yaml` to do this. First, ensure you have the necessary API key environment variables set: ``` export TOGETHER_API_KEY="..." export FIREWORKS_API_KEY="..." export OPENAI_API_KEY="..." ``` Then, run a Llama Stack server that serves up all these providers: ``` llama stack run \ --image-type venv \ tests/verifications/openai-api-verification-run.yaml ``` Finally, generate a new verification report against all these providers, both with and without the Llama Stack server in the middle. ``` python tests/verifications/generate_report.py \ --run-tests \ --provider \ together \ fireworks \ groq \ openai \ together-llama-stack \ fireworks-llama-stack \ groq-llama-stack \ openai-llama-stack ``` You'll see that most of the configurations with Llama Stack in the middle now pass at 100%, even though some of them do not pass at 100% when hitting the backend provider's API directly with an OpenAI client. ### OpenAI Completion Integration Tests with vLLM: I also ran the smaller `test_openai_completion.py` test suite (that's not yet merged with the verification tests) on multiple of the providers, since I had to adjust the method signature of openai_chat_completion a bit and thus had to touch lots of these providers to match. Here's the tests I ran there, all passing: ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct" ``` ### OpenAI Completion Integration Tests with ollama ``` INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0" ``` ### OpenAI Completion Integration Tests with together.ai ``` INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo" ``` ### OpenAI Completion Integration Tests with fireworks.ai ``` INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct" --------- Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-14 08:56:29 -07:00
Sébastien Han	68eeacec0e	docs: resync missing nvidia doc (#1947 ) # What does this PR do? Resync doc. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-14 15:09:16 +02:00
dependabot[bot]	2ec5879f14	chore(github-deps): bump astral-sh/setup-uv from 5.4.0 to 5.4.1 (#1881 ) Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from 5.4.0 to 5.4.1. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/astral-sh/setup-uv/releases">astral-sh/setup-uv's releases</a>.</em></p> <blockquote> <h2>v5.4.1 🌈 Add support for pep440 version specifiers</h2> <h2>Changes</h2> <p>With this release you can also use <a href="https://peps.python.org/pep-0440/#version-specifiers">pep440 version specifiers</a> as <code>required-version</code> in files<code>uv.toml</code>, <code>pyroject.toml</code> and in the <code>version</code> input:</p> <pre lang="yaml"><code>- name: Install a pep440-specifier-satisfying version of uv uses: astral-sh/setup-uv@v5 with: version: ">=0.4.25,<0.5" </code></pre> <h2>🐛 Bug fixes</h2> <ul> <li>Add support for pep440 version identifiers <a href="https://github.com/eifinger"><code>@eifinger</code></a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/353">#353</a>)</li> </ul> <h2>🧰 Maintenance</h2> <ul> <li>chore: update known checksums for 0.6.10 @<a href="https://github.com/apps/github-actions">github-actions[bot]</a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/345">#345</a>)</li> </ul> <h2>📚 Documentation</h2> <ul> <li>Add pep440 to docs header <a href="https://github.com/eifinger"><code>@eifinger</code></a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/355">#355</a>)</li> <li>Fix glob syntax link <a href="https://github.com/flying-sheep"><code>@flying-sheep</code></a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/349">#349</a>)</li> <li>Add link to supported glob patterns <a href="https://github.com/eifinger"><code>@eifinger</code></a> (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/348">#348</a>)</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`0c5e2b8115`"><code>0c5e2b8</code></a> Add pep440 to docs header (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/355">#355</a>)</li> <li><a href="`794ea9455c`"><code>794ea94</code></a> Add support for pep440 version identifiers (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/353">#353</a>)</li> <li><a href="`2d49baf2b6`"><code>2d49baf</code></a> chore: update known checksums for 0.6.10 (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/345">#345</a>)</li> <li><a href="`4fa25599ce`"><code>4fa2559</code></a> Fix glob syntax link (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/349">#349</a>)</li> <li><a href="`224dce1d79`"><code>224dce1</code></a> Add link to supported glob patterns (<a href="https://redirect.github.com/astral-sh/setup-uv/issues/348">#348</a>)</li> <li>See full diff in <a href="`22695119d7...0c5e2b8115`">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=astral-sh/setup-uv&package-manager=github_actions&previous-version=5.4.0&new-version=5.4.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-04-14 14:33:43 +02:00
Yuan Tang	030ca4b2be	docs: Move Llama 4 instructions in a collapsed section (#1936 ) # What does this PR do? Currently the instructions for Llama 4 take quite some space before people can see the overview and other sections about Llama Stack. Moving this to a collapsed section would make it less verbose.	2025-04-14 14:14:59 +02:00
Matthew Farrellee	6d6b40983e	refactor: update integration test workflow (#1856 ) workflow - 0. Checkout 1. Install uv 2. Install Ollama 3. Pull Ollama image 4. Start Ollama in background 5. Set Up Environment and Install Dependencies 6. Wait for Ollama to start 7. Start Llama Stack server in background 8. Wait for Llama Stack server to be ready 9. Run Integration Tests changes - (4) starts the loading of the ollama model, it does not start ollama. the model will be loaded when used. this step is removed. (6) is handled in (2). this step is removed. (2) is renamed to reflect it's dual purpose.	2025-04-14 12:17:51 +02:00
Sébastien Han	69554158fa	feat: add health to all providers through providers endpoint (#1418 ) The `/v1/providers` now reports the health status of each provider when implemented. ``` curl -L http://127.0.0.1:8321/v1/providers\|jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 4072 100 4072 0 0 246k 0 --:--:-- --:--:-- --:--:-- 248k { "data": [ { "api": "inference", "provider_id": "ollama", "provider_type": "remote::ollama", "config": { "url": "http://localhost:11434" }, "health": { "status": "OK" } }, { "api": "vector_io", "provider_id": "faiss", "provider_type": "inline::faiss", "config": { "kvstore": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/faiss_store.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "safety", "provider_id": "llama-guard", "provider_type": "inline::llama-guard", "config": { "excluded_categories": [] }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "agents", "provider_id": "meta-reference", "provider_type": "inline::meta-reference", "config": { "persistence_store": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/agents_store.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "telemetry", "provider_id": "meta-reference", "provider_type": "inline::meta-reference", "config": { "service_name": "llama-stack", "sinks": "console,sqlite", "sqlite_db_path": "/Users/leseb/.llama/distributions/ollama/trace_store.db" }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "eval", "provider_id": "meta-reference", "provider_type": "inline::meta-reference", "config": { "kvstore": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/meta_reference_eval.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "datasetio", "provider_id": "huggingface", "provider_type": "remote::huggingface", "config": { "kvstore": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/huggingface_datasetio.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "datasetio", "provider_id": "localfs", "provider_type": "inline::localfs", "config": { "kvstore": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/localfs_datasetio.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "scoring", "provider_id": "basic", "provider_type": "inline::basic", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "scoring", "provider_id": "llm-as-judge", "provider_type": "inline::llm-as-judge", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "scoring", "provider_id": "braintrust", "provider_type": "inline::braintrust", "config": { "openai_api_key": "******" }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "brave-search", "provider_type": "remote::brave-search", "config": { "api_key": "****", "max_results": 3 }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "tavily-search", "provider_type": "remote::tavily-search", "config": { "api_key": "****", "max_results": 3 }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "code-interpreter", "provider_type": "inline::code-interpreter", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "rag-runtime", "provider_type": "inline::rag-runtime", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "model-context-protocol", "provider_type": "remote::model-context-protocol", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "wolfram-alpha", "provider_type": "remote::wolfram-alpha", "config": { "api_key": "******" }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } } ] } ``` Per providers too: ``` curl -L http://127.0.0.1:8321/v1/providers/ollama {"api":"inference","provider_id":"ollama","provider_type":"remote::ollama","config":{"url":"http://localhost:11434"},"health":{"status":"OK"}} ``` Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-14 11:59:36 +02:00
Ashwin Bharambe	ff14773fa7	fix: update llama stack client dependency	2025-04-12 18:14:33 -07:00
Ashwin Bharambe	429f6de7d7	fix: misc fixes for tests kill horrible warnings	2025-04-12 17:12:11 -07:00
Ashwin Bharambe	8b4158169f	fix: dont check protocol compliance for experimental methods	2025-04-12 16:26:32 -07:00
ehhuang	ad86a68a32	feat: support '-' in tool names (#1807 ) # What does this PR do? titled ## Test Plan added new unit tests pytest -s -v tests/unit/models/llama/llama3/test_tool_utils.py	2025-04-12 14:23:03 -07:00
Ashwin Bharambe	ef3dc143ec	fix: test_registration was borked somehow	2025-04-12 12:04:01 -07:00
ehhuang	1e5bf6c19d	feat: update default tool use prompt (#1803 ) # What does this PR do? User reports in https://github.com/meta-llama/llama-stack/issues/1769#issuecomment-2755564632 that Agent uses tool even on a prompt 'Hello'. Updated the default prompt. Also move the instruction part out of `function_description` so that user can override it if desired. ## Test Plan <img width="1344" alt="image" src="https://github.com/user-attachments/assets/c606d65d-071f-4211-a719-b4742676acda" /> Also performance on 100 hotpotqa questions are similar to the current prompt.	2025-04-12 11:54:22 -07:00
Ashwin Bharambe	f34f22f8c7	feat: add batch inference API to llama stack inference (#1945 ) # What does this PR do? This PR adds two methods to the Inference API: - `batch_completion` - `batch_chat_completion` The motivation is for evaluations targeting a local inference engine (like meta-reference or vllm) where batch APIs provide for a substantial amount of acceleration. Why did I not add this to `Api.batch_inference` though? That just resulted in a _lot_ more book-keeping given the structure of Llama Stack. Had I done that, I would have needed to create a notion of a "batch model" resource, setup routing based on that, etc. This does not sound ideal. So what's the future of the batch inference API? I am not sure. Maybe we can keep it for true _asynchronous_ execution. So you can submit requests, and it can return a Job instance, etc. ## Test Plan Run meta-reference-gpu using: ```bash export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000 export MODEL_PARALLEL_SIZE=4 export MAX_BATCH_SIZE=32 export MAX_SEQ_LEN=6144 LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu ``` Then run the batch inference test case.	2025-04-12 11:41:12 -07:00
Nathan Weinberg	854c2ad264	fix: misleading help text for 'llama stack build' and 'llama stack run' (#1910 ) # What does this PR do? current text for 'llama stack build' and 'llama stack run' says that if no argument is passed to '--image-name' that the active Conda environment will be used in reality, the active enviroment is used whether it is from conda, virtualenv, etc. ## Test Plan N/A ## Documentation N/A Signed-off-by: Nathan Weinberg <nweinber@redhat.com>	2025-04-12 01:19:11 -07:00
Charlie Doern	0751a960a5	feat: make training config fields optional (#1861 ) # What does this PR do? Today, supervised_fine_tune itself and the `TrainingConfig` class have a bunch of required fields that a provider implementation might not need. for example, if a provider wants to handle hyperparameters in its configuration as well as any type of dataset retrieval, optimizer or LoRA config, a user will still need to pass in a virtually empty `DataConfig`, `OptimizerConfig` and `AlgorithmConfig` in some cases. Many of these fields are intended to work specifically with llama models and knobs intended for customizing inline. Adding remote post_training providers will require loosening these arguments, or forcing users to pass in empty objects to satisfy the pydantic models. Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-04-12 01:13:45 -07:00
Ashwin Bharambe	70a7e4d51e	fix: unhide python_start, python_end	2025-04-11 20:30:44 -07:00
Aidan Reilly	51492bd9b6	docs: Update docs and fix warning in start-stack.sh (#1937 ) Small docs update and an update for `start-stack.sh` with missing color and if statment logic. # What does this PR do? 1. Makes a small change to start-stack.sh to resolve this error: ```cmd /home/aireilly/.local/lib/python3.13/site-packages/llama_stack/distribution/start_stack.sh: line 76: [: missing ]' ``` 2. Adds a missing $GREEN colour to start-stack.sh 3. Updated `docs/source/getting_started/detailed_tutorial.md` with some small changes and corrections. ## Test Plan Procedures described in `docs/source/getting_started/detailed_tutorial.md` were verified on Linux Fedora 41.	2025-04-11 16:26:17 -07:00
raghotham	ed58a94b30	docs: fixes to quick start (#1943 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) --------- Co-authored-by: Francisco Arceo <farceo@redhat.com>	2025-04-11 13:41:23 -07:00
Ben Browning	2b2db5fbda	feat: OpenAI-Compatible models, completions, chat/completions (#1894 ) # What does this PR do? This stubs in some OpenAI server-side compatibility with three new endpoints: /v1/openai/v1/models /v1/openai/v1/completions /v1/openai/v1/chat/completions This gives common inference apps using OpenAI clients the ability to talk to Llama Stack using an endpoint like http://localhost:8321/v1/openai/v1 . The two "v1" instances in there isn't awesome, but the thinking is that Llama Stack's API is v1 and then our OpenAI compatibility layer is compatible with OpenAI V1. And, some OpenAI clients implicitly assume the URL ends with "v1", so this gives maximum compatibility. The openai models endpoint is implemented in the routing layer, and just returns all the models Llama Stack knows about. The following providers should be working with the new OpenAI completions and chat/completions API: * remote::anthropic (untested) * remote::cerebras-openai-compat (untested) * remote::fireworks (tested) * remote::fireworks-openai-compat (untested) * remote::gemini (untested) * remote::groq-openai-compat (untested) * remote::nvidia (tested) * remote::ollama (tested) * remote::openai (untested) * remote::passthrough (untested) * remote::sambanova-openai-compat (untested) * remote::together (tested) * remote::together-openai-compat (untested) * remote::vllm (tested) The goal to support this for every inference provider - proxying directly to the provider's OpenAI endpoint for OpenAI-compatible providers. For providers that don't have an OpenAI-compatible API, we'll add a mixin to translate incoming OpenAI requests to Llama Stack inference requests and translate the Llama Stack inference responses to OpenAI responses. This is related to #1817 but is a bit larger in scope than just chat completions, as I have real use-cases that need the older completions API as well. ## Test Plan ### vLLM ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct" ``` ### ollama ``` INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0" ``` ## Documentation Run a Llama Stack distribution that uses one of the providers mentioned in the list above. Then, use your favorite OpenAI client to send completion or chat completion requests with the base_url set to http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the host and port of your Llama Stack server, if different. --------- Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-11 13:14:17 -07:00
Francisco Arceo	24d70cedca	docs: Updated docs to show minimal RAG example and some other minor changes (#1935 ) # What does this PR do? Incorporating some feedback into the docs. - `docs/source/getting_started/index.md`: - Demo actually does RAG now - Simplified the installation command for dependencies. - Updated demo script examples to align with the latest API changes. - Replaced manual document manipulation with `RAGDocument` for clarity and maintainability. - Introduced new logic for model and embedding selection using the Llama Stack Client SDK. - Enhanced examples to showcase proper agent initialization and logging. - `docs/source/getting_started/detailed_tutorial.md`: - Updated the section for listing models to include proper code formatting with `bash`. - Removed and reorganized the "Run the Demos" section for clarity. - Adjusted tab-item structures and added new instructions for demo scripts. - `docs/_static/css/my_theme.css`: - Updated heading styles to include `h2`, `h3`, and `h4` for consistent font weight. - Added a new style for `pre` tags to wrap text and break long words, this is particularly useful for rendering long output from generation. ## Test Plan Tested locally. Screenshot for reference: <img width="1250" alt="Screenshot 2025-04-10 at 10 12 12 PM" src="https://github.com/user-attachments/assets/ce1c8986-e072-4c6f-a697-ed0d8fb75b34" /> --------- Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>	2025-04-11 11:50:36 -07:00
Jash Gulabrai	c1cb6aad11	feat: Add unit tests for NVIDIA safety (#1897 ) # What does this PR do? This PR adds unit tests for the NVIDIA Safety provider implementation. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] 1. Ran `./scripts/unit-tests.sh tests/unit/providers/nvidia/test_safety.py` from the root of the project. Verified tests pass. ``` tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_init_nemo_guardrails Initializing NVIDIASafetyAdapter(http://nemo.test)... PASSED tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_init_nemo_guardrails_invalid_temperature Initializing NVIDIASafetyAdapter(http://nemo.test)... PASSED tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_register_shield_with_valid_id Initializing NVIDIASafetyAdapter(http://nemo.test)... PASSED tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_register_shield_without_id Initializing NVIDIASafetyAdapter(http://nemo.test)... PASSED tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_run_shield_allowed Initializing NVIDIASafetyAdapter(http://nemo.test)... PASSED tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_run_shield_blocked Initializing NVIDIASafetyAdapter(http://nemo.test)... PASSED tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_run_shield_http_error Initializing NVIDIASafetyAdapter(http://nemo.test)... PASSED tests/unit/providers/nvidia/test_safety.py::TestNVIDIASafetyAdapter::test_run_shield_not_found Initializing NVIDIASafetyAdapter(http://nemo.test)... PASSED ``` [//]: # (## Documentation) --------- Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>	2025-04-11 11:49:55 -07:00
Ben Browning	2a74f0db39	fix: remove extra sft args in NvidiaPostTrainingAdapter (#1939 ) # What does this PR do? The supervised_fine_tune method in NvidiaPostTrainingAdapter had some extra args that aren't part of the post_training protocol, and these extra args were causing FastAPI to throw an error when attempting to stand up an endpoint that used this provider. (Closes #1938) ## Test Plan Before this change, bringing up a stack with the `nvidia` template failed. Afterwards, it passes. I'm testing this like: ``` INFERENCE_MODEL="meta/llama-3.1-8b-instruct" \ llama stack build --template nvidia --image-type venv --run ``` I also ensured the nvidia/test_supervised_fine_tuning.py tests still pass via: ``` python -m pytest \ tests/unit/providers/nvidia/test_supervised_fine_tuning.py ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-11 10:17:57 -07:00
Ilya Kolchinsky	40f41af2f7	feat: Add a direct (non-agentic) RAG option to the Playground RAG page (#1940 ) # What does this PR do? This PR makes it possible to switch between agentic and non-agentic RAG when running the respective Playground page. When non-agentic RAG is selected, user queries are answered by directly querying the vector DB, augmenting the prompt, and sending the extended prompt to the model via Inference API. ## Test Plan - Launch the Playground and go to the RAG page; - Select the vector DB ID; - Adjust other configuration parameters if necessary; - Set the radio button to Agent-based RAG; - Send a message to the chat; - The query will be answered by an agent using the knowledge search tool as indicated by the output; - Click the 'Clear Chat' button to make it possible to switch modes; - Send a message to the chat again; - This time, the query will be answered by the model directly as can be deduced from the reply.	2025-04-11 10:16:10 -07:00
Matthew Farrellee	c6fa47db6f	fix: ensure resource registration arguments are typed (#1941 ) # What does this PR do? closes https://github.com/meta-llama/llama-stack/issues/1586 this issue arises when loading an mcp_endpoint from run.yaml. the issue does not manifest for mcp servers added via a running distro server. the existing tests only cover the case of adding to a running server. the code for loading run.yaml strips type information from mcp_endpoint, passing `{"uri": ...}` instead of `URL(uri=...)` along to the resource provider registration. ## Test Plan 1. run an mcp server 2. add an mcp tool config to the dev.py, e.g. ``` diff --git a/llama_stack/templates/dev/dev.py b/llama_stack/templates/dev/dev.py index 69924acb..e0dc7189 100644 --- a/llama_stack/templates/dev/dev.py +++ b/llama_stack/templates/dev/dev.py @@ -6,6 +6,8 @@ from typing import List, Tuple +from llama_stack.apis.common.content_types import URL + from llama_stack.apis.models.models import ModelType from llama_stack.distribution.datatypes import ( ModelInput, @@ -154,6 +156,11 @@ def get_distribution_template() -> DistributionTemplate: toolgroup_id="builtin::code_interpreter", provider_id="code-interpreter", ), + ToolGroupInput( + toolgroup_id="mcp::filesystem", + provider_id="model-context-protocol", + mcp_endpoint=URL(uri="http://localhost:8002/sse"), + ), ] embedding_model = ModelInput( model_id="all-MiniLM-L6-v2", ``` 3. run distro_codegen.py 4. llama stack build --template dev --run before this pr, the `llama stack run` would fail w/ `AttributeError: 'dict' object has no attribute 'uri'`, after it will succeed.	2025-04-11 09:25:57 -07:00
Mark Campbell	6aa459b00c	docs: fix errors in kubernetes deployment guide (#1914 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] Fixes a couple of errors in PVC/Secret setup and adds context for expected Hugging Face token [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation)	2025-04-11 13:04:13 +02:00
ehhuang	2fcb70b789	test(verification): overwrite test result instead of creating new ones (#1934 ) # What does this PR do? ## Test Plan (myenv) ➜ llama-stack python tests/verifications/generate_report.py --providers fireworks,together,openai --run-tests	2025-04-10 16:59:28 -07:00
ehhuang	a4cc4b7e31	test(verification): add streaming tool calling test (#1933 ) # What does this PR do? ## Test Plan --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1933). * #1934 * __->__ #1933	2025-04-10 16:58:06 -07:00
Francisco Arceo	49955a06b1	docs: Update quickstart page to structure things a little more for the novices (#1873 ) # What does this PR do? Another doc enhancement for https://github.com/meta-llama/llama-stack/issues/1818 Summary of changes: - `docs/source/distributions/configuration.md` - Updated dropdown title to include a more user-friendly description. - `docs/_static/css/my_theme.css` - Added styling for `<h3>` elements to set a normal font weight. - `docs/source/distributions/starting_llama_stack_server.md` - Changed section headers from bold text to proper markdown headers (e.g., `##`). - Improved descriptions for starting Llama Stack server using different methods (library, container, conda, Kubernetes). - Enhanced clarity and structure by converting instructions into markdown headers and improved formatting. - `docs/source/getting_started/index.md` - Major restructuring of the "Quick Start" guide: - Added new introductory section for Llama Stack and its capabilities. - Reorganized steps into clearer subsections with proper markdown headers. - Replaced dropdowns with tabbed content for OS-specific instructions. - Added detailed steps for setting up and running the Llama Stack server and client. - Introduced new sections for running basic inference and building agents. - Enhanced readability and visual structure with emojis, admonitions, and examples. - `docs/source/providers/index.md` - Updated the list of LLM inference providers to include "Ollama." - Expanded the list of vector databases to include "SQLite-Vec." Let me know if you need further details! ## Test Plan Renders locally, included screenshot. # Documentation For https://github.com/meta-llama/llama-stack/issues/1818 <img width="1332" alt="Screenshot 2025-04-09 at 11 07 12 AM" src="https://github.com/user-attachments/assets/c106efb9-076c-4059-a4e0-a30fa738585b" /> --------- Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>	2025-04-10 14:09:00 -07:00
Sébastien Han	edd9aaac3b	fix: use torchao 0.8.0 for inference (#1925 ) # What does this PR do? While building the "experimental-post-training" distribution, we encountered a version conflict between torchao with inference requiring version 0.5.0 and training currently depending on version 0.8.0. Resolves this error: ``` × No solution found when resolving dependencies: ╰─▶ Because you require torchao==0.5.0 and torchao==0.8.0, we can conclude that your requirements are unsatisfiable. ERROR 2025-04-10 10:41:22,597 llama_stack.distribution.build:128 uncategorized: Failed to build target test with return code 1 ``` Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-10 13:39:20 -07:00
Ilya Kolchinsky	79fc81f78f	fix: Playground RAG page errors (#1928 ) # What does this PR do? This PR fixes two issues with the RAG page of the Playground UI: 1. When the user modifies a configurable setting via a widget (e.g., system prompt, temperature, etc.), the agent is not recreated. Thus, the change has no effect and the user gets no indication of that. 2. After the first issue is fixed, it becomes possible to recreate the agent mid-conversation or even mid-generation. To mitigate this, widgets related to agent configuration are now disabled when a conversation is in progress (i.e., when the chat is non-empty). They are automatically enabled again when the user resets the chat history. ## Test Plan - Launch the Playground and go to the RAG page; - Select the vector DB ID; - Send a message to the agent via the chat; - The widgets in charge of the agent parameters will become disabled at this point; - Send a second message asking the model about the content of the first message; - The reply will indicate that the two messages were sent over the same session, that is, the agent was not recreated; - Click the 'Clear Chat' button; - All widgets will be enabled and a new agent will be created (which can be validated by sending another message).	2025-04-10 13:38:31 -07:00
Francisco Arceo	de6ec5803e	fix: Fix linter failures from #1921 (#1932 ) # What does this PR do? fix: Fix linter failures from #1921 Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>	2025-04-10 10:37:31 -07:00
ehhuang	14146e4b3f	feat(verification): various improvements (#1921 ) # What does this PR do? - provider and their models now live in config.yaml - better distinguish different cases within a test - add model key to surface provider's model_id - include example command to rerun single test case ## Test Plan <img width="1173" alt="image" src="https://github.com/user-attachments/assets/b414baf0-c768-451f-8c3b-c2905cf36fac" />	2025-04-10 10:26:19 -07:00
Francisco Arceo	09a83b1ec1	docs: Updating background color for code in darkmode (#1930 ) # What does this PR do? A small quality of life adjustment to make the code background for darkmode black. Makes it much easier to differentiate between code and non-code text. From: <img width="1250" alt="Screenshot 2025-04-10 at 9 22 23 AM" src="https://github.com/user-attachments/assets/3a3aea8b-e540-4e76-a7db-6c276e389cc2" /> To: <img width="1273" alt="Screenshot 2025-04-10 at 9 22 43 AM" src="https://github.com/user-attachments/assets/6ada2cb1-2c33-4a95-be88-7b4c65d4ba93" /> The CSS was sourced from here: https://github.com/MrDogeBro/sphinx_rtd_dark_mode/blob/main/sphinx_rtd_dark_mode/static/dark_mode_css/dark.css Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>	2025-04-10 09:38:57 -07:00
Sébastien Han	1f2df59ece	docs: fix model name (#1926 ) # What does this PR do? Use llama3.2:3b for consistency. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-10 09:37:48 -07:00
Yuan Tang	1be66d754e	docs: Redirect instructions for additional hardware accelerators for remote vLLM provider (#1923 ) # What does this PR do? vLLM website just added a [new index page for installing for different hardware accelerators](https://docs.vllm.ai/en/latest/getting_started/installation.html). This PR adds a link to that page with additional edits to make sure readers are aware that the use of GPUs on this page are for demonstration purposes only. This closes https://github.com/meta-llama/llama-stack/issues/1813. Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-04-10 10:04:17 +02:00
Yuan Tang	712c6758c6	docs: Avoid bash script syntax highlighting for dark mode (#1918 ) See https://github.com/meta-llama/llama-stack/pull/1913#issuecomment-2790153778 Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-04-09 15:43:43 -07:00
Jiawen Liu	36a31fe5dd	fix: on-the-fly int4 quantize parameter (#1920 ) Mirror to https://github.com/meta-llama/llama-models/pull/324 with some clean up ``` with-proxy pip install -e . export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct export QUANTIZATION_TYPE=int4_mixed with-proxy llama stack build --run --template meta-reference-gpu ``` # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation)	2025-04-09 15:00:12 -07:00
Ashwin Bharambe	e2299291c4	fix: Mirror llama4 rope scaling fixes, small model simplify (#1917 ) See: - https://github.com/meta-llama/llama-models/pull/322 - https://github.com/meta-llama/llama-models/pull/320	2025-04-09 11:28:45 -07:00
Sébastien Han	770b38f8b5	chore: simplify running the demo UI (#1907 ) # What does this PR do? * Manage UI deps in pyproject * Use a new "ui" dep group to pull the deps with "uv" * Simplify the run command * Bump versions in requirements.txt Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-09 11:22:29 -07:00

1 2 3 4 5 ...

1762 commits