llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-05 02:17:31 +00:00

Author	SHA1	Message	Date
Chirag Modi	fb8ff77ff2	docs: 0.2.2 doc updates (#1961 ) Add updates to android site readme for 0.2.2	2025-04-15 13:26:17 -07:00
Dmitry Rogozhkin	71ed47ea76	docs: add example for intel gpu in vllm remote (#1952 ) # What does this PR do? PR adds instructions to setup vLLM remote endpoint for vllm-remote llama stack distribution. ## Test Plan * Verified with manual tests of the configured vllm-remote against vllm endpoint running on the system with Intel GPU * Also verified with ci pytests (see cmdline below). Test passes in the same capacity as it does on the A10 Nvidia setup (some tests do fail which seems to be known issues with vllm remote llama stack distribution) ``` pytest -s -v tests/integration/inference/test_text_inference.py \ --stack-config=http://localhost:5001 \ --text-model=meta-llama/Llama-3.2-3B-Instruct ``` CC: @ashwinb Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>	2025-04-15 07:56:23 -07:00
Ben Browning	7641a5cd0b	fix: 100% OpenAI API verification for together and fireworks (#1946 ) # What does this PR do? TLDR: Changes needed to get 100% passing tests for OpenAI API verification tests when run against Llama Stack with the `together`, `fireworks`, and `openai` providers. And `groq` is better than before, at 88% passing. This cleans up the OpenAI API support for image message types (specifically `image_url` types) and handling of the `response_format` chat completion parameter. Both of these required a few more Pydantic model definitions in our Inference API, just to move from the not-quite-right stubs I had in place to something fleshed out to match the actual OpenAI API specs. As part of testing this, I also found and fixed a bug in the litellm implementation of openai_completion and openai_chat_completion, so the providers based on those should actually be working now. The method `prepare_openai_completion_params` in `llama_stack/providers/utils/inference/openai_compat.py` was improved to actually recursively clean up input parameters, including handling of lists, dicts, and dumping of Pydantic models to dicts. These changes were required to get to 100% passing tests on the OpenAI API verification against the `openai` provider. With the above, the together.ai provider was passing as well as it is without Llama Stack. But, since we have Llama Stack in the middle, I took the opportunity to clean up the together.ai provider so that it now also passes the OpenAI API spec tests we have at 100%. That means together.ai is now passing our verification test better when using an OpenAI client talking to Llama Stack than it is when hitting together.ai directly, without Llama Stack in the middle. And, another round of work for Fireworks to improve translation of incoming OpenAI chat completion requests to Llama Stack chat completion requests gets the fireworks provider passing at 100%. The server-side fireworks.ai tool calling support with OpenAI chat completions and Llama 4 models isn't great yet, but by pointing the OpenAI clients at Llama Stack's API we can clean things up and get everything working as expected for Llama 4 models. ## Test Plan ### OpenAI API Verification Tests I ran the OpenAI API verification tests as below and 100% of the tests passed. First, start a Llama Stack server that runs the `openai` provider with the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template setup to do this out of the box, so I added a `tests/verifications/openai-api-verification-run.yaml` to do this. First, ensure you have the necessary API key environment variables set: ``` export TOGETHER_API_KEY="..." export FIREWORKS_API_KEY="..." export OPENAI_API_KEY="..." ``` Then, run a Llama Stack server that serves up all these providers: ``` llama stack run \ --image-type venv \ tests/verifications/openai-api-verification-run.yaml ``` Finally, generate a new verification report against all these providers, both with and without the Llama Stack server in the middle. ``` python tests/verifications/generate_report.py \ --run-tests \ --provider \ together \ fireworks \ groq \ openai \ together-llama-stack \ fireworks-llama-stack \ groq-llama-stack \ openai-llama-stack ``` You'll see that most of the configurations with Llama Stack in the middle now pass at 100%, even though some of them do not pass at 100% when hitting the backend provider's API directly with an OpenAI client. ### OpenAI Completion Integration Tests with vLLM: I also ran the smaller `test_openai_completion.py` test suite (that's not yet merged with the verification tests) on multiple of the providers, since I had to adjust the method signature of openai_chat_completion a bit and thus had to touch lots of these providers to match. Here's the tests I ran there, all passing: ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct" ``` ### OpenAI Completion Integration Tests with ollama ``` INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0" ``` ### OpenAI Completion Integration Tests with together.ai ``` INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo" ``` ### OpenAI Completion Integration Tests with fireworks.ai ``` INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct" --------- Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-14 08:56:29 -07:00
Sébastien Han	68eeacec0e	docs: resync missing nvidia doc (#1947 ) # What does this PR do? Resync doc. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-14 15:09:16 +02:00
Nathan Weinberg	854c2ad264	fix: misleading help text for 'llama stack build' and 'llama stack run' (#1910 ) # What does this PR do? current text for 'llama stack build' and 'llama stack run' says that if no argument is passed to '--image-name' that the active Conda environment will be used in reality, the active enviroment is used whether it is from conda, virtualenv, etc. ## Test Plan N/A ## Documentation N/A Signed-off-by: Nathan Weinberg <nweinber@redhat.com>	2025-04-12 01:19:11 -07:00
raghotham	ed58a94b30	docs: fixes to quick start (#1943 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) --------- Co-authored-by: Francisco Arceo <farceo@redhat.com>	2025-04-11 13:41:23 -07:00
Mark Campbell	6aa459b00c	docs: fix errors in kubernetes deployment guide (#1914 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] Fixes a couple of errors in PVC/Secret setup and adds context for expected Hugging Face token [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation)	2025-04-11 13:04:13 +02:00
Francisco Arceo	49955a06b1	docs: Update quickstart page to structure things a little more for the novices (#1873 ) # What does this PR do? Another doc enhancement for https://github.com/meta-llama/llama-stack/issues/1818 Summary of changes: - `docs/source/distributions/configuration.md` - Updated dropdown title to include a more user-friendly description. - `docs/_static/css/my_theme.css` - Added styling for `<h3>` elements to set a normal font weight. - `docs/source/distributions/starting_llama_stack_server.md` - Changed section headers from bold text to proper markdown headers (e.g., `##`). - Improved descriptions for starting Llama Stack server using different methods (library, container, conda, Kubernetes). - Enhanced clarity and structure by converting instructions into markdown headers and improved formatting. - `docs/source/getting_started/index.md` - Major restructuring of the "Quick Start" guide: - Added new introductory section for Llama Stack and its capabilities. - Reorganized steps into clearer subsections with proper markdown headers. - Replaced dropdowns with tabbed content for OS-specific instructions. - Added detailed steps for setting up and running the Llama Stack server and client. - Introduced new sections for running basic inference and building agents. - Enhanced readability and visual structure with emojis, admonitions, and examples. - `docs/source/providers/index.md` - Updated the list of LLM inference providers to include "Ollama." - Expanded the list of vector databases to include "SQLite-Vec." Let me know if you need further details! ## Test Plan Renders locally, included screenshot. # Documentation For https://github.com/meta-llama/llama-stack/issues/1818 <img width="1332" alt="Screenshot 2025-04-09 at 11 07 12 AM" src="https://github.com/user-attachments/assets/c106efb9-076c-4059-a4e0-a30fa738585b" /> --------- Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>	2025-04-10 14:09:00 -07:00
Yuan Tang	1be66d754e	docs: Redirect instructions for additional hardware accelerators for remote vLLM provider (#1923 ) # What does this PR do? vLLM website just added a [new index page for installing for different hardware accelerators](https://docs.vllm.ai/en/latest/getting_started/installation.html). This PR adds a link to that page with additional edits to make sure readers are aware that the use of GPUs on this page are for demonstration purposes only. This closes https://github.com/meta-llama/llama-stack/issues/1813. Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-04-10 10:04:17 +02:00
Yuan Tang	712c6758c6	docs: Avoid bash script syntax highlighting for dark mode (#1918 ) See https://github.com/meta-llama/llama-stack/pull/1913#issuecomment-2790153778 Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-04-09 15:43:43 -07:00
AlexHe99	983f6feeb8	docs: Update remote-vllm.md with AMD GPU vLLM server supported. (#1858 ) Add the content to use AMD GPU as the vLLM server. Split the original part to two sub chapters, 1. AMD vLLM server 2. NVIDIA vLLM server (orignal) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) --------- Signed-off-by: Alex He <alehe@amd.com>	2025-04-08 21:35:32 -07:00
ehhuang	7b4eb0967e	test: verification on provider's OAI endpoints (#1893 ) # What does this PR do? ## Test Plan export MODEL=accounts/fireworks/models/llama4-scout-instruct-basic; LLAMA_STACK_CONFIG=verification pytest -s -v tests/integration/inference --vision-model $MODEL --text-model $MODEL	2025-04-07 23:06:28 -07:00
Matthew Farrellee	c52ccc4bbd	docs: update importing_as_library.md (#1863 ) LlamaStackAsLibraryClient.initialize is not async, cannot be await'd	2025-04-07 12:31:04 +02:00
Francisco Arceo	d495922949	docs: Updated documentation and Sphinx configuration (#1845 ) # What does this PR do? The goal of this PR is to make the pages easier to navigate by surfacing the child pages on the navbar, updating some of the copy, moving some of the files around. Some changes: 1. Clarifying Titles 2. Restructuring "Distributions" more formally in its own page to be consistent with Providers and adding some clarity to the child pages to surface them and make them easier to navigate 3. Updated sphinx config to not collapse navigation by default 4. Updated copyright year to be calculated dynamically 5. Moved `docs/source/distributions/index.md` -> `docs/source/distributions/starting_llama_stack_server.md` Another for https://github.com/meta-llama/llama-stack/issues/1815 ## Test Plan Tested locally and pages build (screen shots for example). ## Documentation ### Before: ![Screenshot 2025-03-31 at 1 09 21 PM](https://github.com/user-attachments/assets/98e34f76-f0d9-4055-8e2c-441b1e7d8f6a) ### After: ![Screenshot 2025-03-31 at 1 08 52 PM](https://github.com/user-attachments/assets/dfb6b8ad-3a1d-46b6-8f54-0c553664093f) Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>	2025-03-31 13:08:05 -07:00
Ihar Hrachyshka	18bac27d4e	fix: Use CONDA_DEFAULT_ENV presence as a flag to use conda mode (#1555 ) # What does this PR do? This is the second attempt to switch to system packages by default. Now with a hack to detect conda environment - in which case conda image-type is used. Note: Conda will only be used when --image-name is unset and CONDA_DEFAULT_ENV is set. This means that users without conda will correctly fall back to using system packages when no --image-* arguments are passed at all. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Uses virtualenv: ``` $ llama stack build --template ollama --image-type venv $ llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml [...] Using virtual environment: /home/ec2-user/src/llama-stack/schedule/.local [...] ``` Uses system packages (virtualenv already initialized): ``` $ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml [...] INFO 2025-03-27 20:46:22,882 llama_stack.cli.stack.run:142 server: No image type or image name provided. Assuming environment packages. [...] ``` Attempt to run from environment packages without necessary packages installed: ``` $ python -m venv barebones $ . ./barebones/bin/activate $ pip install -e . # to install llama command $ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml [...] ModuleNotFoundError: No module named 'fastapi' ``` ^ failed as expected because the environment doesn't have necessary packages installed. Now install some packages in the new environment: ``` $ pip install fastapi opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp aiosqlite ollama openai datasets faiss-cpu mcp autoevals $ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml [...] Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) ``` Now see if setting CONDA_DEFAULT_ENV will change what happens by default: ``` $ export CONDA_DEFAULT_ENV=base $ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml [...] Using conda environment: base Conda environment base does not exist. [...] ``` --------- Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-03-27 17:13:22 -04:00
Xi Yan	b5c27f77ad	chore: clean up distro doc (#1804 ) # What does this PR do? - hide distro doc (docker needs to be thoroughly tested). [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan - docs [//]: # (## Documentation)	2025-03-27 12:12:14 -07:00
Dmitry Rogozhkin	935e706b15	docs: fix remote-vllm instructions (#1805 ) # What does this PR do? * Fix location of `run.yaml` relative to the cloned llama stack repository * Drop `-it` from `docker run` commands as its not needed running services ## Test Plan * Verified running the llama stack following updated instruction CC: @ashwinb Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>	2025-03-27 10:19:51 -04:00
Rashmi Pawar	1a73f8305b	feat: Add nemo customizer (#1448 ) # What does this PR do? This PR adds support for NVIDIA's NeMo Customizer API to the Llama Stack post-training module. The integration enables users to fine-tune models using NVIDIA's cloud-based customization service through a consistent Llama Stack interface. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] Yet to be done Things pending under this PR: - [x] Integration of fine-tuned model(new checkpoint) for inference with nvidia llm distribution - [x] distribution integration of API - [x] Add test cases for customizer(In Progress) - [x] Documentation ``` LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -v tests/client-sdk/post_training/test_supervised_fine_tuning.py ============================================================================================================================================================================ test session starts ============================================================================================================================================================================= platform linux -- Python 3.10.0, pytest-8.3.4, pluggy-1.5.0 -- /home/ubuntu/llama-stack/.venv/bin/python cachedir: .pytest_cache metadata: {'Python': '3.10.0', 'Platform': 'Linux-6.8.0-1021-gcp-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'nbval': '0.11.0', 'metadata': '3.1.1', 'anyio': '4.8.0', 'html': '4.1.1', 'asyncio': '0.25.3'}} rootdir: /home/ubuntu/llama-stack configfile: pyproject.toml plugins: nbval-0.11.0, metadata-3.1.1, anyio-4.8.0, html-4.1.1, asyncio-0.25.3 asyncio: mode=strict, asyncio_default_fixture_loop_scope=None collected 2 items tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_post_training_provider_registration[txt=8B] PASSED [ 50%] tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_list_training_jobs[txt=8B] PASSED [100%] ======================================================================================================================================================================== 2 passed, 1 warning in 0.10s ======================================================================================================================================================================== ``` cc: @mattf @dglogo @sumitb --------- Co-authored-by: Ubuntu <ubuntu@llama-stack-customizer-dev-inst-2tx95fyisatvlic4we8hidx5tfj.us-central1-a.c.brevdevprod.internal>	2025-03-25 11:01:10 -07:00
Yuan Tang	9ff82036f7	docs: Simplify vLLM deployment in K8s deployment guide (#1655 ) # What does this PR do? * Removes the use of `huggingface-cli` * Simplifies HF cache mount path * Simplifies vLLM server startup command * Separates PVC/secret creation from deployment/service * Fixes a typo: "pod" should be "deployment" Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-03-24 09:08:50 -07:00
Hardik Shah	127bac6869	fix: Default to port 8321 everywhere (#1734 ) As titled, moved all instances of 5001 to 8321	2025-03-20 15:50:41 -07:00
Hardik Shah	581e8ae562	fix: docker run with `--pull always` to fetch the latest image (#1733 ) As titled	2025-03-20 15:35:48 -07:00
Yuan Tang	f5a5c5d459	docs: Add instruction on enabling tool calling for remote vLLM (#1719 ) # What does this PR do? This PR adds a link to tool calling instructions in vLLM. Users have asked about this many times, e.g. https://github.com/meta-llama/llama-stack/issues/1648#issuecomment-2740642077 --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-03-20 15:18:17 -07:00
Nathan Weinberg	1261bc93bf	docs: fixed broken tip in distro build docs (#1673 ) # What does this PR do? fixed broken tip in distro build docs ## Test Plan Local docs build Signed-off-by: Nathan Weinberg <nweinber@redhat.com>	2025-03-17 17:22:26 -07:00
cdgamarose-nv	252a487085	feat: added nvidia as safety provider (#1248 ) # What does this PR do? Adds nvidia as a safety provider by interfacing with the nemo guardrails microservice. This enables checking user’s input or the LLM’s output against input and output guardrails by using the `/v1/guardrails/checks` endpoint of the[ guardrails API.](https://developer.nvidia.com/docs/nemo-microservices/guardrails/source/guides/checks-guide.html) ## Test Plan Deploy nemo guardrails service following the documentation: https://developer.nvidia.com/docs/nemo-microservices/guardrails/source/getting-started/deploy-docker.html ### Standalone: ```bash (venv) local-cdgamarose@a1u1g-rome-0153:~/llama-stack$ pytest -v -s llama_stack/providers/tests/safety/test_safety.py --providers inference=nvidia,safety=nvidia --safety-shield meta/llama-3.1-8b-instruct =================================================================================== test session starts =================================================================================== platform linux -- Python 3.10.12, pytest-8.3.4, pluggy-1.5.0 -- /localhome/local-cdgamarose/llama-stack/venv/bin/python3 cachedir: .pytest_cache metadata: {'Python': '3.10.12', 'Platform': 'Linux-5.15.0-122-generic-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0', 'html': '4.1.1'}} rootdir: /localhome/local-cdgamarose/llama-stack configfile: pyproject.toml plugins: metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0, html-4.1.1 asyncio: mode=strict, asyncio_default_fixture_loop_scope=None collected 2 items llama_stack/providers/tests/safety/test_safety.py::TestSafety::test_shield_list[--inference=nvidia:safety=nvidia] Initializing NVIDIASafetyAdapter(http://0.0.0.0:7331)... PASSED llama_stack/providers/tests/safety/test_safety.py::TestSafety::test_run_shield[--inference=nvidia:safety=nvidia] PASSED ============================================================================== 2 passed, 2 warnings in 4.78s ============================================================================== ``` ### Distribution: ``` llama stack run llama_stack/templates/nvidia/run-with-safety.yaml curl -v -X 'POST' "http://localhost:8321/v1/safety/run-shield" -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"shield_id": "meta/llama-3.1-8b-instruct", "messages":[{"role": "user", "content": "you are stupid"}]}' {"violation":{"violation_level":"error","user_message":"Sorry I cannot do this.","metadata":{"self check input":{"status":"blocked"}}}} ``` [//]: # (## Documentation) --------- Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-03-17 14:39:23 -07:00
Kelly Brown	60ae7455f6	docs: Fix trailing whitespace error (#1669 ) Description: Fixes the trailing whitespace error thats coming up on main	2025-03-17 08:53:30 -07:00
Chirag Modi	b56b06037c	Web updates to point to latest releases for Mobile SDK (#1650 ) # What does this PR do? Web updates to point to latest releases for Mobile SDK - point to `latest-release` branch for mobile sdk repos to minimize the number of change points on the site. - updates to some instructions	2025-03-14 17:06:07 -07:00
Xi Yan	9617468d13	fix: passthrough provider template + fix (#1612 ) # What does this PR do? - Fix issue w/ passthrough provider [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan llama stack run [//]: # (## Documentation)	2025-03-13 09:44:26 -07:00
Dinesh Yeduguru	85501ed875	fix: remove Llama-3.2-1B-Instruct for fireworks (#1558 ) # What does this PR do? remove Llama-3.2-1B-Instruct for fireworks as its no longer appears to be hosted on website. ## Test Plan python distro_codegen.py	2025-03-11 11:19:29 -07:00
Charlie Doern	b647ecd9ed	feat: add support for LLAMA_STACK_LOG_FILE (#1450 ) # What does this PR do? setting $LLAMA_STACK_LOG_FILE will pipe the logs to a file as well as stdout. this is done by using a logging FileHandler Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-03-11 11:09:31 -07:00
Ashwin Bharambe	dc84bc755a	fix: revert to using faiss for ollama distro (#1530 ) This is unfortunate because `sqlite-vec` seems promising. But its PIP package is not quite complete. It does not have binary for arm64 (I think, or maybe it even lacks 64 bit builds?) which results in the arm64 container resulting in ``` File "/usr/local/lib/python3.10/site-packages/sqlite_vec/init.py", line 17, in load conn.load_extension(loadable_path()) sqlite3.OperationalError: /usr/local/lib/python3.10/site-packages/sqlite_vec/vec0.so: wrong ELF class: ELFCLASS32 ``` To get around I tried to install from source via `uv pip install sqlite-vec --no-binary=sqlite-vec` however it even lacks a source distribution which makes that impossible. ## Test Plan Build the container locally using: ```bash LLAMA_STACK_DIR=. llama stack build --template ollama --image-type container ``` Run the container as: ``` podman run --privileged -it -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \ -v ~/.llama:/root/.llama \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ --env OLLAMA_URL=http://host.containers.internal:11434 \ -v ~/local/llama-stack:/app/llama-stack-source localhost/distribution-ollama:dev --port $LLAMA_STACK_PORT ``` Verify the container starts up correctly. Without this patch, it would encounter the ELFCLASS32 error.	2025-03-10 16:15:17 -07:00
Reid	0b8cb830b9	docs: update ollama doc url (#1508 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] It should changed in this pr https://github.com/meta-llama/llama-stack/pull/1190/files#diff-53e3f35ced54ee5e57dc8b0d3b04770ed84f2f6434c6f492f42569b3c2810ecd [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: reidliu <reid201711@gmail.com> Co-authored-by: reidliu <reid201711@gmail.com>	2025-03-10 13:04:59 -07:00
Charlie Doern	1097912054	refactor: display defaults in help text (#1480 ) # What does this PR do? using `formatter_class=argparse.ArgumentDefaultsHelpFormatter` displays (default: DEFAULT_VALUE) for each flag. add this formatter class to build and run to show users some default values like `conda`, `8321`, etc ## Test Plan ran locally with following output: before: ``` llama stack run --help usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--disable-ipv6] [--env KEY=VALUE] [--tls-keyfile TLS_KEYFILE] [--tls-certfile TLS_CERTFILE] [--image-type {conda,container,venv}] config Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution. positional arguments: config Path to config file to use for the run options: -h, --help show this help message and exit --port PORT Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. Defaults to 8321 --image-name IMAGE_NAME Name of the image to run. Defaults to the current conda environment --disable-ipv6 Disable IPv6 support --env KEY=VALUE Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times. --tls-keyfile TLS_KEYFILE Path to TLS key file for HTTPS --tls-certfile TLS_CERTFILE Path to TLS certificate file for HTTPS --image-type {conda,container,venv} Image Type used during the build. This can be either conda or container or venv. ``` after: ``` llama stack run --help usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--disable-ipv6] [--env KEY=VALUE] [--tls-keyfile TLS_KEYFILE] [--tls-certfile TLS_CERTFILE] [--image-type {conda,container,venv}] config Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution. positional arguments: config Path to config file to use for the run options: -h, --help show this help message and exit --port PORT Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. (default: 8321) --image-name IMAGE_NAME Name of the image to run. Defaults to the current conda environment (default: None) --disable-ipv6 Disable IPv6 support (default: False) --env KEY=VALUE Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times. (default: []) --tls-keyfile TLS_KEYFILE Path to TLS key file for HTTPS (default: None) --tls-certfile TLS_CERTFILE Path to TLS certificate file for HTTPS (default: None) --image-type {conda,container,venv} Image Type used during the build. This can be either conda or container or venv. (default: conda) ``` [//]: # (## Documentation) Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-03-07 11:05:58 -08:00
Reid	40cd48fa09	chore: remove the incorrect output (#1472 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) Based on the client output changed, so the output is incorrect: `458e20702b/src/llama_stack_client/lib/cli/models/models.py (L52)` and https://github.com/meta-llama/llama-stack/pull/1348#pullrequestreview-2654971315 previous discussion that no need to maintain the output, so remove it. ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: reidliu <reid201711@gmail.com> Co-authored-by: reidliu <reid201711@gmail.com>	2025-03-07 10:39:33 -08:00
Charlie Doern	8d86137ab2	docs: add information on how to set log level before running (#1430 ) # What does this PR do? currently logcat is not documented for build && run. Add documentation in building_distro.md Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-03-06 10:54:14 -08:00
Ashwin Bharambe	abfbaf3c1b	refactor(test): move tools, evals, datasetio, scoring and post training tests (#1401 ) All of the tests from `llama_stack/providers/tests/` are now moved to `tests/integration`. I converted the `tools`, `scoring` and `datasetio` tests to use API. However, `eval` and `post_training` proved to be a bit challenging to leaving those. I think `post_training` should be relatively straightforward also. As part of this, I noticed that `wolfram_alpha` tool wasn't added to some of our commonly used distros so I added it. I am going to remove a lot of code duplication from distros next so while this looks like a one-off right now, it will go away and be there uniformly for all distros.	2025-03-04 14:53:47 -08:00
Reid	cb085d56c6	docs: fix typo (#1390 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) --------- Signed-off-by: reidliu <reid201711@gmail.com> Co-authored-by: reidliu <reid201711@gmail.com>	2025-03-04 09:02:55 -08:00
Reid	5c9d12a206	chore: improve --port help text (#1346 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] It would be better to tell user env var usage in help text. ``` before: $ llama stack run --help --port PORT Port to run the server on. Defaults to 8321 after $ llama stack run --help --port PORT Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. Defaults to 8321 ``` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: reidliu <reid201711@gmail.com> Co-authored-by: reidliu <reid201711@gmail.com>	2025-03-03 16:49:03 -08:00
Reid	dc069025f5	chore: fix typo (#1343 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] `21ec67356c/distributions` It should missed the `s`. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: reidliu <reid201711@gmail.com> Co-authored-by: reidliu <reid201711@gmail.com>	2025-03-01 10:36:04 -08:00
Reid	66cd128ab5	docs: update the downloaded list doc (#1266 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] Since released the `--downloaded` option, so update the related documents. [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: reidliu <reid201711@gmail.com> Co-authored-by: reidliu <reid201711@gmail.com>	2025-02-28 10:10:12 -08:00
Reid	5366dab31e	docs: update build doc (#1262 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] `55eb257459/llama_stack/cli/stack/run.py (L22)` `55eb257459/llama_stack/cli/stack/_build.py (L103)` [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: reidliu <reid201711@gmail.com> Co-authored-by: reidliu <reid201711@gmail.com>	2025-02-28 10:03:45 -08:00
Reid	c2d2a80b0a	docs: update the output of llama-stack-client models list (#1271 ) # What does this PR do? [Provide a short summary of what this PR does and why. Link to relevant issues if applicable.] [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) Signed-off-by: reidliu <reid201711@gmail.com> Co-authored-by: reidliu <reid201711@gmail.com>	2025-02-27 16:46:38 -08:00
Ashwin Bharambe	04de2f84e9	fix: register provider model name and HF alias in run.yaml (#1304 ) Each model known to the system has two identifiers: - the `provider_resource_id` (what the provider calls it) -- e.g., `accounts/fireworks/models/llama-v3p1-8b-instruct` - the `identifier` (`model_id`) under which it is registered and gets routed to the appropriate provider. We have so far used the HuggingFace repo alias as the standardized identifier you can use to refer to the model. So in the above example, we'd use `meta-llama/Llama-3.1-8B-Instruct` as the name under which it gets registered. This makes it convenient for users to refer to these models across providers. However, we forgot to register the _actual_ provider model ID also. You should be able to route via `provider_resource_id` also, of course. This change fixes this (somewhat grave) omission. Note: this change is additive -- more aliases work now compared to before. ## Test Plan Run the following for distro=(ollama fireworks together) ``` LLAMA_STACK_CONFIG=$distro \ pytest -s -v tests/client-sdk/inference/test_text_inference.py \ --inference-model=meta-llama/Llama-3.1-8B-Instruct --vision-inference-model="" ```	2025-02-27 16:39:23 -08:00
Ashwin Bharambe	928a39d17b	feat(providers): Groq now uses LiteLLM openai-compat (#1303 ) Groq has never supported raw completions anyhow. So this makes it easier to switch it to LiteLLM. All our test suite passes. I also updated all the openai-compat providers so they work with api keys passed from headers. `provider_data` ## Test Plan ```bash LLAMA_STACK_CONFIG=groq \ pytest -s -v tests/client-sdk/inference/test_text_inference.py \ --inference-model=groq/llama-3.3-70b-versatile --vision-inference-model="" ``` Also tested (openai, anthropic, gemini) providers. No regressions.	2025-02-27 13:16:50 -08:00
Matthew Farrellee	99b6925ad8	feat: add nemo retriever text embedding models to nvidia inference provider (#1218 ) # What does this PR do? add the NeMo Retriever Embedding models from https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/support-matrix.html	2025-02-26 21:18:34 -08:00
Shrey	30ef1c3680	feat: Add model context protocol tools with ollama provider (#1283 ) # What does this PR do? Model context protocol (MCP) allows for remote tools to be connected with Agents. The current Ollama provider does not support it. This PR adds necessary code changes to ensure that the integration between Ollama backend and MCP works. This PR is an extension of #816 for Ollama. ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] 1. Run llama-stack server with the command: ``` llama stack build --template ollama --image-type conda llama stack run ./templates/ollama/run.yaml \ --port $LLAMA_STACK_PORT \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ --env OLLAMA_URL=http://localhost:11434 ``` 2. Run the sample client agent with MCP tool: ``` from llama_stack_client.lib.agents.agent import Agent from llama_stack_client.lib.agents.event_logger import EventLogger from llama_stack_client.types.agent_create_params import AgentConfig from llama_stack_client.types.shared_params.url import URL from llama_stack_client import LlamaStackClient from termcolor import cprint ## Start the local MCP server # git clone https://github.com/modelcontextprotocol/python-sdk # Follow instructions to get the env ready # cd examples/servers/simple-tool # uv run mcp-simple-tool --transport sse --port 8000 # Connect to the llama stack server base_url="http://localhost:8321" model_id="meta-llama/Llama-3.2-3B-Instruct" client = LlamaStackClient(base_url=base_url) # Register MCP tools client.toolgroups.register( toolgroup_id="mcp::filesystem", provider_id="model-context-protocol", mcp_endpoint=URL(uri="http://localhost:8000/sse")) # Define an agent with MCP toolgroup agent_config = AgentConfig( model=model_id, instructions="You are a helpful assistant", toolgroups=["mcp::filesystem"], input_shields=[], output_shields=[], enable_session_persistence=False, ) agent = Agent(client, agent_config) user_prompts = [ "Fetch content from https://www.google.com and print the response" ] # Run a session with the agent session_id = agent.create_session("test-session") for prompt in user_prompts: cprint(f"User> {prompt}", "green") response = agent.create_turn( messages=[ { "role": "user", "content": prompt, } ], session_id=session_id, ) for log in EventLogger().log(response): log.print() ``` # Documentation The file docs/source/distributions/self_hosted_distro/ollama.md is updated to indicate the MCP tool runtime availability. Signed-off-by: Shreyanand <shanand@redhat.com>	2025-02-26 15:38:18 -08:00
Vladislav Bronzov	967cff4533	feat: Add Groq distribution template (#1173 ) # What does this PR do? Create a distribution template using Groq as inference provider. Link to issue: https://github.com/meta-llama/llama-stack/issues/958 ## Test Plan Run `python llama_stack/scripts/distro_codegen.py` to generate run.yaml and build.yaml Test the newly created template by running `llama stack build --template <template-name>` `llama stack run <template-name>`	2025-02-25 14:16:56 -08:00
Reid	1842eeb96f	docs: small fixes (#1224 )	2025-02-24 07:59:58 -05:00
Ashwin Bharambe	992f865b2e	chore: move embedding deps to RAG tool where they are needed (#1210 ) `EMBEDDING_DEPS` were wrongly associated with `vector_io` providers. They are needed by https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/utils/memory/vector_store.py#L142 and related code and is used by the RAG tool and as such should only be needed by the `inline::rag-runtime` provider.	2025-02-21 11:33:41 -08:00
Ashwin Bharambe	11697f85c5	fix: pull ollama embedding model if necessary (#1209 ) Embedding models are tiny and can be pulled on-demand. Let's do that so the user doesn't have to do "yet another thing" to get themselves set up. Thanks @hardikjshah for the suggestion. Also fixed a build dependency miss (TODO: distro_codegen needs to actually check that the build template contains all providers mentioned for the run.yaml file) ## Test Plan First run `ollama rm all-minilm:latest`. Run `llama stack build --template ollama && llama stack run ollama --env INFERENCE_MODEL=llama3.2:3b-instruct-fp16`. See that it outputs a "Pulling embedding model `all-minilm:latest`" output and the stack starts up correctly. Verify that `ollama list` shows the model is correctly downloaded.	2025-02-21 10:35:56 -08:00
Matthew Farrellee	832c535aaf	feat(providers): add NVIDIA Inference embedding provider and tests (#935 ) # What does this PR do? add /v1/inference/embeddings implementation to NVIDIA provider open topics - - asymmetric models. NeMo Retriever includes asymmetric models, which are models that embed differently depending on if the input is destined for storage or lookup against storage. the /v1/inference/embeddings api does not allow the user to indicate the type of embedding to perform. see https://github.com/meta-llama/llama-stack/issues/934 - truncation. embedding models typically have a limited context window, e.g. 1024 tokens is common though newer models have 8k windows. when the input is larger than this window the endpoint cannot perform its designed function. two options: 0. return an error so the user can reduce the input size and retry; 1. perform truncation for the user and proceed (common strategies are left or right truncation). many users encounter context window size limits and will struggle to write reliable programs. this struggle is especially acute without access to the model's tokenizer. the /v1/inference/embeddings api does not allow the user to delegate truncation policy. see https://github.com/meta-llama/llama-stack/issues/933 - dimensions. "Matryoshka" embedding models are available. they allow users to control the number of embedding dimensions the model produces. this is a critical feature for managing storage constraints. embeddings of 1024 dimensions what achieve 95% recall for an application may not be worth the storage cost if a 512 dimensions can achieve 93% recall. controlling embedding dimensions allows applications to determine their recall and storage tradeoffs. the /v1/inference/embeddings api does not allow the user to control the output dimensions. see https://github.com/meta-llama/llama-stack/issues/932 ## Test Plan - `llama stack run llama_stack/templates/nvidia/run.yaml` - `LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/inference/test_embedding.py --embedding-model baai/bge-m3` ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [x] Wrote necessary unit or integration tests. --------- Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-02-20 16:59:48 -08:00

1 2 3

126 commits