llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-28 00:01:59 +00:00

Author	SHA1	Message	Date
Varsha	cfee63bd0d	feat: Add search_mode support to OpenAI vector store API (#2500 ) Some checks failed Integration Tests / test-matrix (http, 3.13, scoring) (push) Failing after 15s Details Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 11s Details Test Llama Stack Build / build-custom-container-distribution (push) Failing after 7s Details Integration Tests / test-matrix (http, 3.13, post_training) (push) Failing after 17s Details Python Package Build Test / build (3.13) (push) Failing after 5s Details Integration Tests / test-matrix (http, 3.13, providers) (push) Failing after 18s Details Test Llama Stack Build / build-single-provider (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 15s Details Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 15s Details Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 13s Details Integration Tests / test-matrix (library, 3.13, inspect) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.13, post_training) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.13, tool_runtime) (push) Failing after 9s Details Integration Tests / test-matrix (http, 3.13, tool_runtime) (push) Failing after 17s Details Unit Tests / unit-tests (3.12) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.13, datasets) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.13, inference) (push) Failing after 13s Details Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 17s Details Integration Tests / test-matrix (library, 3.13, agents) (push) Failing after 16s Details Integration Tests / test-matrix (library, 3.13, vector_io) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.13, providers) (push) Failing after 9s Details Integration Tests / test-matrix (http, 3.12, vector_io) (push) Failing after 18s Details Integration Tests / test-matrix (library, 3.13, scoring) (push) Failing after 8s Details Unit Tests / unit-tests (3.13) (push) Failing after 8s Details Integration Tests / test-matrix (http, 3.13, datasets) (push) Failing after 19s Details Test Llama Stack Build / build (push) Failing after 5s Details Update ReadTheDocs / update-readthedocs (push) Failing after 44s Details Test External Providers / test-external-providers (venv) (push) Failing after 47s Details Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 50s Details Pre-commit / pre-commit (push) Successful in 2m12s Details # What does this PR do? Add search_mode parameter (vector/keyword/hybrid) to openai_search_vector_store method. Fixes OpenAPI code generation by using str instead of Literal type. Closes: #2459 ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed. --> Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>	2025-06-24 20:38:47 -04:00
ehhuang	d3b60507d7	feat: support auth attributes in inference/responses stores (#2389 ) # What does this PR do? Inference/Response stores now store user attributes when inserting, and respects them when fetching. ## Test Plan pytest tests/unit/utils/test_sqlstore.py	2025-06-20 10:24:45 -07:00
Ben Browning	f394c7f2d9	feat: Add missing Vector Store Files API surface (#2468 ) Some checks failed Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 16s Details Integration Tests / test-matrix (http, 3.11, agents) (push) Failing after 26s Details Integration Tests / test-matrix (http, 3.12, tool_runtime) (push) Failing after 19s Details Python Package Build Test / build (3.11) (push) Failing after 5s Details Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 6s Details Python Package Build Test / build (3.12) (push) Failing after 3s Details Integration Tests / test-matrix (http, 3.12, providers) (push) Failing after 18s Details Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 17s Details Integration Tests / test-matrix (library, 3.11, vector_io) (push) Failing after 15s Details Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 18s Details Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 13s Details Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 8s Details Python Package Build Test / build (3.13) (push) Failing after 5s Details Integration Tests / test-matrix (http, 3.11, scoring) (push) Failing after 24s Details Integration Tests / test-matrix (library, 3.11, agents) (push) Failing after 20s Details Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 15s Details Integration Tests / test-matrix (http, 3.12, datasets) (push) Failing after 21s Details Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 15s Details Integration Tests / test-matrix (http, 3.11, inference) (push) Failing after 22s Details Unit Tests / unit-tests (3.11) (push) Failing after 7s Details Update ReadTheDocs / update-readthedocs (push) Failing after 4s Details Unit Tests / unit-tests (3.12) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 48s Details Test External Providers / test-external-providers (venv) (push) Failing after 43s Details Unit Tests / unit-tests (3.13) (push) Failing after 52s Details Pre-commit / pre-commit (push) Successful in 2m4s Details # What does this PR do? This adds the ability to list, retrieve, update, and delete Vector Store Files. It implements these new APIs for the faiss and sqlite-vec providers, since those are the two that also have the rest of the vector store files implementation. Closes #2445 ## Test Plan ### test_openai_vector_stores Integration Tests There are a number of new integration tests added, which I ran for each provider as outlined below. faiss (from ollama distro): ``` INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \ llama stack run llama_stack/templates/ollama/run.yaml LLAMA_STACK_CONFIG=http://localhost:8321 \ pytest -sv tests/integration/vector_io/test_openai_vector_stores.py \ --embedding-model=all-MiniLM-L6-v2 ``` sqlite-vec (from starter distro): ``` llama stack run llama_stack/templates/starter/run.yaml LLAMA_STACK_CONFIG=http://localhost:8321 \ pytest -sv tests/integration/vector_io/test_openai_vector_stores.py \ --embedding-model=all-MiniLM-L6-v2 ``` ### file_search verification tests I also ensured the file_search verification tests continue to work, both for faiss and sqlite-vec. faiss (ollama distro): ``` INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \ llama stack run llama_stack/templates/ollama/run.yaml pytest -sv tests/verifications/openai_api/test_responses.py \ -k'file_search' \ --base-url=http://localhost:8321/v1/openai/v1 \ --model=meta-llama/Llama-3.2-3B-Instruct ``` sqlite-vec (starter distro): ``` llama stack run llama_stack/templates/starter/run.yaml pytest -sv tests/verifications/openai_api/test_responses.py \ -k'file_search' \ --base-url=http://localhost:8321/v1/openai/v1 \ --model=together/meta-llama/Llama-3.2-3B-Instruct-Turbo ``` --------- Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-06-19 11:08:24 -04:00
Sébastien Han	fa1d986f72	fix: remove asyncio.TimeoutError since Python update (#2476 ) # What does this PR do? Since we now support Pythong starting from 3.11, this is not needed anymore. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-06-19 18:52:41 +05:30
Charlie Doern	d12f195f56	feat: drop python 3.10 support (#2469 ) # What does this PR do? dropped python3.10, updated pyproject and dependencies, and also removed some blocks of code with special handling for enum.StrEnum Closes #2458 Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-06-19 12:07:14 +05:30
ehhuang	db2cd9e8f3	feat: support filters in file search (#2472 ) # What does this PR do? Move to use vector_stores.search for file search tool in Responses, which supports filters. closes #2435 ## Test Plan Added e2e test with fitlers. myenv ❯ llama stack run llama_stack/templates/fireworks/run.yaml pytest -sv tests/verifications/openai_api/test_responses.py \ -k 'file_search and filters' \ --base-url=http://localhost:8321/v1/openai/v1 \ --model=meta-llama/Llama-3.3-70B-Instruct	2025-06-18 21:50:55 -07:00
Sumit Jaiswal	90d03552d4	feat: To add health check for faiss inline vector_io provider (#2319 ) Some checks failed Integration Tests / test-matrix (library, 3.10, inspect) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.10, providers) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.10, tool_runtime) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.10, scoring) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.10, vector_io) (push) Failing after 13s Details Integration Tests / test-matrix (library, 3.10, inference) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.11, agents) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.11, inspect) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.11, datasets) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 5s Details Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 5s Details Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 5s Details Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 4s Details Integration Tests / test-matrix (library, 3.11, vector_io) (push) Failing after 5s Details Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 4s Details Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 6s Details Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 4s Details Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 6s Details Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 4s Details Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 7s Details Test External Providers / test-external-providers (venv) (push) Failing after 1m1s Details Unit Tests / unit-tests (3.11) (push) Failing after 1m11s Details Unit Tests / unit-tests (3.10) (push) Failing after 1m13s Details Unit Tests / unit-tests (3.12) (push) Failing after 1m9s Details Unit Tests / unit-tests (3.13) (push) Failing after 15s Details Pre-commit / pre-commit (push) Successful in 1m52s Details # What does this PR do? <!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. --> To add health check for faiss inline vector_io provider. I tried adding `async def health(self) -> HealthResponse:` like in inference provider, but it didn't worked for `inline->vector_io->faiss` provider. And via debug logs, I understood the critical issue, that the health responses are being stored with the API name as the key, not as a nested dictionary with provider IDs. This means that all providers of the same API type (e.g., "vector_io") will share the same health response, and only the last one processed will be visible in the API response. I've created a patch file that fixes this issue by: - Storing the original get_providers_health method - Creating a patched version that correctly maps health responses to providers - Applying the patch to the `ProviderImpl` class Not an expert, so please let me know, if there can be any other workaround using which I can get the health status updated directly from `faiss.py`. <!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed. --> Added unit tests to test the provider patch implementation in the PR. Adding a screenshot with the FAISS inline vector_io health status as "OK" ![faiss_health_check](https://github.com/user-attachments/assets/d769e762-890c-41ea-a596-5e90951f79a4)	2025-06-18 17:56:25 +02:00
Hardik Shah	822307e6d5	fix: Do not throw when listing vector stores (#2460 ) When trying to `list` vector_stores , if we cannot retrieve one, log an error and return all the ones that are valid. ### Test Plan ``` pytest -sv --stack-config=http://localhost:8321 tests/integration/vector_io/test_openai_vector_stores.py --embedding-model all-MiniLM-L6-v2 ``` Also tested for `--stack-config fireworks`	2025-06-17 11:19:43 -07:00
Hardik Shah	985d0b156c	feat: Add `suffix` to openai_completions (#2449 ) Some checks failed Integration Tests / test-matrix (library, 3.10, inspect) (push) Failing after 9s Details Integration Tests / test-matrix (http, 3.11, providers) (push) Failing after 5s Details Integration Tests / test-matrix (library, 3.10, providers) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.10, post_training) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.10, scoring) (push) Failing after 13s Details Integration Tests / test-matrix (library, 3.11, agents) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 6s Details Integration Tests / test-matrix (library, 3.11, datasets) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.11, inspect) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.11, vector_io) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 9s Details Test External Providers / test-external-providers (venv) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 14s Details Unit Tests / unit-tests (3.10) (push) Failing after 19s Details Unit Tests / unit-tests (3.11) (push) Failing after 20s Details Unit Tests / unit-tests (3.12) (push) Failing after 18s Details Unit Tests / unit-tests (3.13) (push) Failing after 16s Details Update ReadTheDocs / update-readthedocs (push) Failing after 8s Details Pre-commit / pre-commit (push) Successful in 58s Details For code completion apps need "fill in the middle" capabilities. Added option of `suffix` to `openai_completion` to enable this. Updated ollama provider to showcase the same. ### Test Plan ``` pytest -sv --stack-config="inference=ollama" tests/integration/inference/test_openai_completion.py --text-model qwen2.5-coder:1.5b -k test_openai_completion_non_streaming_suffix ``` ### OpenAI Sample script ``` from openai import OpenAI client = OpenAI(base_url="http://localhost:8321/v1/openai/v1") response = client.completions.create( model="qwen2.5-coder:1.5b", prompt="The capital of ", suffix="is Paris.", max_tokens=10, ) print(response.choices[0].text) ``` ### Output ``` France is ____. To answer this question, we ```	2025-06-13 16:06:06 -07:00
Ben Browning	941f505eb0	feat: File search tool for Responses API (#2426 ) # What does this PR do? This is an initial working prototype of wiring up the `file_search` builtin tool for the Responses API to our existing rag knowledge search tool. This is me seeing what I could pull together on top of the bits we already have merged. This may not be the ideal way to implement this, and things like how I shuffle the vector store ids from the original response API tool request to the actual tool execution feel a bit hacky (grep for `tool_kwargs["vector_db_ids"]` in `_execute_tool_call` to see what I mean). ## Test Plan I stubbed in some new tests to exercise this using text and pdf documents. Note that this is currently under tests/verification only because it sometimes flakes with tool calling of the small Llama-3.2-3B model we run in CI (and that I use as an example below). We'd want to make the test a bit more robust in some way if we moved this over to tests/integration and ran it in CI. ### OpenAI SaaS (to verify test correctness) ``` pytest -sv tests/verifications/openai_api/test_responses.py \ -k 'file_search' \ --base-url=https://api.openai.com/v1 \ --model=gpt-4o ``` ### Fireworks with faiss vector store ``` llama stack run llama_stack/templates/fireworks/run.yaml pytest -sv tests/verifications/openai_api/test_responses.py \ -k 'file_search' \ --base-url=http://localhost:8321/v1/openai/v1 \ --model=meta-llama/Llama-3.3-70B-Instruct ``` ### Ollama with faiss vector store This sometimes flakes on Ollama because the quantized small model doesn't always choose to call the tool to answer the user's question. But, it often works. ``` ollama run llama3.2:3b INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \ llama stack run ./llama_stack/templates/ollama/run.yaml \ --image-type venv \ --env OLLAMA_URL="http://0.0.0.0:11434" pytest -sv tests/verifications/openai_api/test_responses.py \ -k'file_search' \ --base-url=http://localhost:8321/v1/openai/v1 \ --model=meta-llama/Llama-3.2-3B-Instruct ``` ### OpenAI provider with sqlite-vec vector store ``` llama stack run ./llama_stack/templates/starter/run.yaml --image-type venv pytest -sv tests/verifications/openai_api/test_responses.py \ -k 'file_search' \ --base-url=http://localhost:8321/v1/openai/v1 \ --model=openai/gpt-4o-mini ``` ### Ensure existing vector store integration tests still pass ``` ollama run llama3.2:3b INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \ llama stack run ./llama_stack/templates/ollama/run.yaml \ --image-type venv \ --env OLLAMA_URL="http://0.0.0.0:11434" LLAMA_STACK_CONFIG=http://localhost:8321 \ pytest -sv tests/integration/vector_io \ --text-model "meta-llama/Llama-3.2-3B-Instruct" \ --embedding-model=all-MiniLM-L6-v2 ``` --------- Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-06-13 14:32:48 -04:00
Hardik Shah	0bc1747ed8	feat: update search for vector_stores (#2441 ) Updated the `search` functionality return response to match openai. ## Test Plan ``` pytest -sv --stack-config=http://localhost:8321 tests/integration/vector_io/test_openai_vector_stores.py --embedding-model all-MiniLM-L6-v2 ```	2025-06-12 15:34:22 -07:00
Hardik Shah	de37a04c3e	fix: set appropriate defaults for params (#2434 ) Some checks failed Integration Tests / test-matrix (http, 3.11, post_training) (push) Failing after 15s Details Integration Tests / test-matrix (library, 3.10, scoring) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.10, inspect) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.11, datasets) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.10, datasets) (push) Failing after 17s Details Integration Tests / test-matrix (library, 3.11, inspect) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.10, agents) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.10, tool_runtime) (push) Failing after 14s Details Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.10, post_training) (push) Failing after 19s Details Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.11, agents) (push) Failing after 16s Details Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 13s Details Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 17s Details Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 19s Details Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 15s Details Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 13s Details Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 13s Details Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 14s Details Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 13s Details Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 15s Details Test External Providers / test-external-providers (venv) (push) Failing after 20s Details Update ReadTheDocs / update-readthedocs (push) Failing after 17s Details Unit Tests / unit-tests (3.12) (push) Failing after 20s Details Unit Tests / unit-tests (3.11) (push) Failing after 1m39s Details Unit Tests / unit-tests (3.13) (push) Failing after 1m37s Details Unit Tests / unit-tests (3.10) (push) Failing after 1m41s Details Pre-commit / pre-commit (push) Failing after 3h4m8s Details Setting defaults to be `\| None` else they get marked as required params in open-api spec.	2025-06-11 17:30:34 -07:00
Hardik Shah	d55100d9b7	feat: OpenAIVectorIOMixin for vector_stores common logic (#2427 ) Extracts common OpenAI vector-store code into its own mixin so that all providers can share the same core logic. This also makes it easy for Llama Stack to support both vector-stores and Llama Stack APIs in the interim so that both share the same underlying vector-dbs. Each provider contains storage specific logic to `create / edit / delete / list` vector dbs while the plumbing logic is standardized in the common code. Ensured that this works well with both faiss and sqllite-vec. ### Test Plan ``` llama stack run starter pytest -sv --stack-config http://localhost:8321 tests/integration/vector_io/test_openai_vector_stores.py --embedding-model all-MiniLM-L6-v2 ```	2025-06-11 15:40:57 -07:00
Rohan Awhad	4e37b49cdc	fix: #1867 InferenceRouter has no attribute formatter (#2422 ) Some checks failed Integration Tests / test-matrix (http, 3.12, agents) (push) Failing after 49s Details Integration Tests / test-matrix (http, 3.11, inspect) (push) Failing after 53s Details Integration Tests / test-matrix (http, 3.10, datasets) (push) Failing after 57s Details Integration Tests / test-matrix (library, 3.10, inspect) (push) Failing after 17s Details Integration Tests / test-matrix (http, 3.10, scoring) (push) Failing after 55s Details Integration Tests / test-matrix (http, 3.12, datasets) (push) Failing after 50s Details Integration Tests / test-matrix (http, 3.11, tool_runtime) (push) Failing after 51s Details Integration Tests / test-matrix (library, 3.10, tool_runtime) (push) Failing after 15s Details Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 5s Details Integration Tests / test-matrix (library, 3.10, providers) (push) Failing after 17s Details Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 6s Details Integration Tests / test-matrix (library, 3.11, datasets) (push) Failing after 14s Details Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.11, inspect) (push) Failing after 14s Details Test External Providers / test-external-providers (venv) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 13s Details Unit Tests / unit-tests (3.12) (push) Failing after 10s Details Unit Tests / unit-tests (3.13) (push) Failing after 9s Details Unit Tests / unit-tests (3.10) (push) Failing after 2m9s Details Unit Tests / unit-tests (3.11) (push) Failing after 2m7s Details Pre-commit / pre-commit (push) Failing after 3h13m50s Details # What does this PR do? Closes #1867 [Steps to reproduce the bug](https://github.com/meta-llama/llama-stack/issues/1867#issuecomment-2956819381) The change was designed to minimize code changes. Open to option of skipping `metrics` field entirely when `telemetry` is disabled. ## Test Plan 1. Build llama-stack remote-vllm container ```bash llama stack build --template remote-vllm --image-type container ``` 2. Create a small run.yaml ```yaml version: '2' image_name: remote-vllm apis: - inference providers: inference: - provider_id: vllm-inference provider_type: remote::vllm config: url: ${env.VLLM_URL:http://localhost:8000/v1} max_tokens: ${env.VLLM_MAX_TOKENS:4096} api_token: ${env.VLLM_API_TOKEN:fake} tls_verify: ${env.VLLM_TLS_VERIFY:true} metadata_store: type: sqlite db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/remote-vllm}/registry.db inference_store: type: sqlite db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/remote-vllm}/inference_store.db models: - metadata: {} model_id: ${env.INFERENCE_MODEL} provider_id: vllm-inference model_type: llm shields: [] vector_dbs: [] datasets: [] scoring_fns: [] benchmarks: [] server: port: 8321 ``` 3. Run the llama-stack server ```bash export VLLM_URL="http://localhost:8000/v1" export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack run run.yaml ``` 4. Then perform a curl ```bash curl -X 'POST' \ 'http://localhost:8321/v1/inference/completion' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model_id": "meta-llama/Llama-3.2-3B-Instruct", "content": "string", "sampling_params": { "strategy": { "type": "greedy" }, "max_tokens": 10, "repetition_penalty": 1, "stop": [ "string" ] }, "stream": false, "logprobs": { "top_k": 0 } }' ``` 5. You should receive a 200 response with metric values set to 0, similar to one below: ``` { "metrics": [ { "metric": "prompt_tokens", "value": 0, "unit": null }, { "metric": "completion_tokens", "value": 0, "unit": null }, { "metric": "total_tokens", "value": 0, "unit": null } ], [...] } ``` Co-authored-by: Rohan Awhad <rawhad@redhat.com>	2025-06-11 18:14:41 +02:00
Hardik Shah	5ac43268e8	feat: Add OpenAI compat /v1/vector_store APIs (#2423 ) Some checks failed Integration Tests / test-matrix (library, 3.10, providers) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.10, scoring) (push) Failing after 11s Details Integration Tests / test-matrix (http, 3.10, post_training) (push) Failing after 41s Details Integration Tests / test-matrix (library, 3.10, datasets) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.10, post_training) (push) Failing after 13s Details Integration Tests / test-matrix (http, 3.10, tool_runtime) (push) Failing after 46s Details Integration Tests / test-matrix (library, 3.10, tool_runtime) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.11, agents) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.11, datasets) (push) Failing after 14s Details Integration Tests / test-matrix (library, 3.11, inspect) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 12s Details Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 14s Details Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 7s Details Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 5s Details Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 16s Details Test External Providers / test-external-providers (venv) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 15s Details Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 13s Details Update ReadTheDocs / update-readthedocs (push) Failing after 8s Details Unit Tests / unit-tests (3.13) (push) Failing after 11s Details Unit Tests / unit-tests (3.12) (push) Failing after 1m31s Details Unit Tests / unit-tests (3.11) (push) Failing after 1m33s Details Unit Tests / unit-tests (3.10) (push) Failing after 1m35s Details Pre-commit / pre-commit (push) Failing after 3h13m41s Details Adding OpenAI compat `/v1/vector-store` apis. This PR implements the `faiss` provider with followup PRs coming up for other providers. Added routes to create, update, delete, list vector stores. Also added route to search a vector store Inserting into vector stores is missing and will be a follow up diff. ### Test Plan - Added new integration test for testing the faiss provider ``` pytest -sv --stack-config http://localhost:8321 tests/integration/vector_io/test_openai_vector_stores.py --embedding-model all-MiniLM-L6-v2 ```	2025-06-10 13:07:39 -07:00
Sumit Jaiswal	33ecefd284	feat: To add health status check for remote VLLM (#2303 ) Some checks failed Integration Tests / test-matrix (library, 3.10, agents) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.10, datasets) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.10, inference) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.10, inspect) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.10, post_training) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.10, providers) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.10, scoring) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.11, agents) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.10, tool_runtime) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.11, datasets) (push) Failing after 11s Details Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.11, inspect) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 15s Details Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 9s Details Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 10s Details Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 9s Details Test External Providers / test-external-providers (venv) (push) Failing after 7s Details Unit Tests / unit-tests (3.10) (push) Failing after 8s Details Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 11s Details Unit Tests / unit-tests (3.11) (push) Failing after 9s Details Unit Tests / unit-tests (3.13) (push) Failing after 8s Details Unit Tests / unit-tests (3.12) (push) Failing after 8s Details Pre-commit / pre-commit (push) Successful in 56s Details # What does this PR do? <!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. --> To add health status check for remote VLLM <!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed. --> PR includes the unit test to test the added health check implementation feature.	2025-06-06 15:33:12 -04:00
grs	7c1998db25	feat: fine grained access control policy (#2264 ) This allows a set of rules to be defined for determining access to resources. The rules are (loosely) based on the cedar policy format. A rule defines a list of action either to permit or to forbid. It may specify a principal or a resource that must match for the rule to take effect. It may also specify a condition, either a 'when' or an 'unless', with additional constraints as to where the rule applies. A list of rules is held for each type to be protected and tried in order to find a match. If a match is found, the request is permitted or forbidden depening on the type of rule. If no match is found, the request is denied. If no rules are specified for a given type, a rule that allows any action as long as the resource attributes match the user attributes is added (i.e. the previous behaviour is the default. Some examples in yaml: ``` model: - permit: principal: user-1 actions: [create, read, delete] comment: user-1 has full access to all models - permit: principal: user-2 actions: [read] resource: model-1 comment: user-2 has read access to model-1 only - permit: actions: [read] when: user_in: resource.namespaces comment: any user has read access to models with matching attributes vector_db: - forbid: actions: [create, read, delete] unless: user_in: role::admin comment: only user with admin role can use vector_db resources ``` --------- Signed-off-by: Gordon Sim <gsim@redhat.com>	2025-06-03 14:51:12 -07:00
Hardik Shah	b21050935e	feat: New OpenAI compat embeddings API (#2314 ) Some checks failed Integration Tests / test-matrix (http, agents) (push) Failing after 9s Details Integration Tests / test-matrix (http, scoring) (push) Failing after 9s Details Integration Tests / test-matrix (library, inference) (push) Failing after 9s Details Integration Tests / test-matrix (library, inspect) (push) Failing after 9s Details Integration Tests / test-matrix (library, post_training) (push) Failing after 15s Details Integration Tests / test-matrix (library, providers) (push) Failing after 14s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 43s Details Integration Tests / test-matrix (library, scoring) (push) Failing after 8s Details Integration Tests / test-matrix (http, inference) (push) Failing after 46s Details Integration Tests / test-matrix (library, tool_runtime) (push) Failing after 8s Details Integration Tests / test-matrix (library, agents) (push) Failing after 44s Details Integration Tests / test-matrix (http, inspect) (push) Failing after 47s Details Integration Tests / test-matrix (http, providers) (push) Failing after 45s Details Integration Tests / test-matrix (library, datasets) (push) Failing after 45s Details Integration Tests / test-matrix (http, post_training) (push) Failing after 46s Details Integration Tests / test-matrix (http, tool_runtime) (push) Failing after 47s Details Integration Tests / test-matrix (http, datasets) (push) Failing after 49s Details Test External Providers / test-external-providers (venv) (push) Failing after 6s Details Update ReadTheDocs / update-readthedocs (push) Failing after 6s Details Unit Tests / unit-tests (3.12) (push) Failing after 7s Details Unit Tests / unit-tests (3.10) (push) Failing after 8s Details Unit Tests / unit-tests (3.11) (push) Failing after 8s Details Unit Tests / unit-tests (3.13) (push) Failing after 7s Details Pre-commit / pre-commit (push) Successful in 1m12s Details # What does this PR do? Adds a new endpoint that is compatible with OpenAI for embeddings api. `/openai/v1/embeddings` Added providers for OpenAI, LiteLLM and SentenceTransformer. ## Test Plan ``` LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/inference/test_openai_embeddings.py --embedding-model all-MiniLM-L6-v2,text-embedding-3-small,gemini/text-embedding-004 ```	2025-05-31 22:11:47 -07:00
Ashwin Bharambe	ce33d02443	fix(tools): do not index tools, only index toolgroups (#2261 ) When registering a MCP endpoint, we cannot list tools (like we used to) since the MCP endpoint may be behind an auth wall. Registration can happen much sooner (via run.yaml). Instead, we do listing only when the _user_ actually calls listing. Furthermore, we cache the list in-memory in the server. Currently, the cache is not invalidated -- we may want to periodically re-list for MCP servers. Note that they must call `list_tools` before calling `invoke_tool` -- we use this critically. This will enable us to list MCP servers in run.yaml ## Test Plan Existing tests, updated tests accordingly.	2025-05-25 13:27:52 -07:00
Ashwin Bharambe	298721c238	chore: split routing_tables into individual files (#2259 )	2025-05-24 23:15:05 -07:00
Ashwin Bharambe	eedf21f19c	chore: split routers into individual files (inference, tool, vector_io, eval_scoring) (#2258 )	2025-05-24 22:59:07 -07:00
Ashwin Bharambe	ae7272d8ff	chore: split routers into individual files (datasets) (#2249 )	2025-05-24 22:11:43 -07:00
Ashwin Bharambe	a2160dc0af	chore: split routers into individual files (safety) Reviewers: bbrowning, leseb, ehhuang, terrytangyuan, raghotham, yanxi0830, hardikjshah Reviewed By: raghotham Pull Request: https://github.com/meta-llama/llama-stack/pull/2248	2025-05-24 22:00:32 -07:00
ehhuang	549812f51e	feat: implement get chat completions APIs (#2200 ) # What does this PR do? * Provide sqlite implementation of the APIs introduced in https://github.com/meta-llama/llama-stack/pull/2145. * Introduced a SqlStore API: llama_stack/providers/utils/sqlstore/api.py and the first Sqlite implementation * Pagination support will be added in a future PR. ## Test Plan Unit test on sql store: <img width="1005" alt="image" src="https://github.com/user-attachments/assets/9b8b7ec8-632b-4667-8127-5583426b2e29" /> Integration test: ``` INFERENCE_MODEL="llama3.2:3b-instruct-fp16" llama stack build --template ollama --image-type conda --run ``` ``` LLAMA_STACK_CONFIG=http://localhost:5001 INFERENCE_MODEL="llama3.2:3b-instruct-fp16" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-fp16" -k 'inference_store and openai' ```	2025-05-21 22:21:52 -07:00
Ihar Hrachyshka	db21eab713	fix: catch TimeoutError in place of asyncio.TimeoutError (#2131 ) # What does this PR do? As per docs [1], since python 3.11 wait_for() raises TimeoutError. Since we currently support python 3.10+, we have to catch both. [1]: https://docs.python.org/3.12/library/asyncio-task.html#asyncio.wait_for [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan No explicit testing; just code hardening to reflect docs. [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-05-12 11:49:59 +02:00
Ben Browning	40e71758d9	fix: inference providers still using tools with `tool_choice="none"` (#2048 ) # What does this PR do? In our OpenAI API verification tests, some providers were still calling tools even when `tool_choice="none"` was passed in the chat completion requests. Because they aren't all respecting `tool_choice` properly, this adjusts our routing implementation to remove the `tools` and `tool_choice` from the request if `tool_choice="none"` is passed in so that it does not attempt to call any of those tools. Adjusting this in the router fixes this across all providers. This also cleans up the non-streaming together.ai responses for tools, ensuring it returns `None` instead of an empty list when there are no tool calls, to exactly match the OpenAI API responses in that case. ## Test Plan I observed existing failures in our OpenAI API verification suite - see https://github.com/bbrowning/llama-stack-tests/blob/main/openai-api-verification/2025-04-27.md#together-llama-stack for the failing `test_chat_*_tool_choice_none` tests. All streaming and non-streaming variants were failing across all 3 tested models. After this change, all of those 6 failing tests are now passing with no regression in the other tests. I verified this via: ``` llama stack run --image-type venv \ tests/verifications/openai-api-verification-run.yaml ``` ``` python -m pytest -s -v \ 'tests/verifications/openai_api/test_chat_completion.py' \ --provider=together-llama-stack ``` The entire verification suite is not 100% on together.ai yet, but it's getting closer. This also increased the pass rate for fireworks.ai, and did not regress the groq or openai tests at all. Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-05-07 14:34:47 +02:00
Ihar Hrachyshka	9e6561a1ec	chore: enable pyupgrade fixes (#1806 ) # What does this PR do? The goal of this PR is code base modernization. Schema reflection code needed a minor adjustment to handle UnionTypes and collections.abc.AsyncIterator. (Both are preferred for latest Python releases.) Note to reviewers: almost all changes here are automatically generated by pyupgrade. Some additional unused imports were cleaned up. The only change worth of note can be found under `docs/openapi_generator` and `llama_stack/strong_typing/schema.py` where reflection code was updated to deal with "newer" types. Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-05-01 14:23:50 -07:00
Rashmi Pawar	e6bbf8d20b	feat: Add NVIDIA NeMo datastore (#1852 ) # What does this PR do? Implemetation of NeMO Datastore register, unregister API. Open Issues: - provider_id gets set to `localfs` in client.datasets.register() as it is specified in routing_tables.py: DatasetsRoutingTable see: #1860 Currently I have passed `"provider_id":"nvidia"` in metadata and have parsed that in `DatasetsRoutingTable` (Not the best approach, but just a quick workaround to make it work for now.) ## Test Plan - Unit test cases: `pytest tests/unit/providers/nvidia/test_datastore.py` ```bash ========================================================== test session starts =========================================================== platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0 rootdir: /home/ubuntu/llama-stack configfile: pyproject.toml plugins: anyio-4.9.0, asyncio-0.26.0, nbval-0.11.0, metadata-3.1.1, html-4.1.1, cov-6.1.0 asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function collected 2 items tests/unit/providers/nvidia/test_datastore.py .. [100%] ============================================================ warnings summary ============================================================ ====================================================== 2 passed, 1 warning in 0.84s ====================================================== ``` cc: @dglogo, @mattf, @yanxi0830	2025-04-28 09:41:59 -07:00
Ben Browning	fa5dfee07b	fix: Return HTTP 400 for OpenAI API validation errors (#2002 ) # What does this PR do? When clients called the Open AI API with invalid input that wasn't caught by our own Pydantic API validation but instead only caught by the backend inference provider, that backend inference provider was returning a HTTP 400 error. However, we were wrapping that into a HTTP 500 error, obfuscating the actual issue from calling clients and triggering OpenAI client retry logic. This change adjusts our existing `translate_exception` method in `server.py` to wrap `openai.BadRequestError` as HTTP 400 errors, passing through the string representation of the error message to the calling user so they can see the actual input validation error and correct it. I tried changing this in a few other places, but ultimately `translate_exception` was the only real place to handle this for both streaming and non-streaming requests across all inference providers that use the OpenAI server APIs. This also tightens up our validation a bit for the OpenAI chat completions API, to catch empty `messages` parameters, invalid `tool_choice` parameters, invalid `tools` items, or passing `tool_choice` when `tools` isn't given. Lastly, this extends our OpenAI API chat completions verifications to also check for consistent input validation across providers. Providers behind Llama Stack should automatically pass all the new tests due to the input validation added here, but some of the providers fail this test when not run behind Llama Stack due to differences in how they handle input validation and errors. (Closes #1951) ## Test Plan To test this, start an OpenAI API verification stack: ``` llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml ``` Then, run the new verification tests with your provider(s) of choice: ``` python -m pytest -s -v \ tests/verifications/openai_api/test_chat_completion.py \ --provider openai-llama-stack python -m pytest -s -v \ tests/verifications/openai_api/test_chat_completion.py \ --provider together-llama-stack ``` Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-23 17:48:32 +02:00
Ben Browning	7641a5cd0b	fix: 100% OpenAI API verification for together and fireworks (#1946 ) # What does this PR do? TLDR: Changes needed to get 100% passing tests for OpenAI API verification tests when run against Llama Stack with the `together`, `fireworks`, and `openai` providers. And `groq` is better than before, at 88% passing. This cleans up the OpenAI API support for image message types (specifically `image_url` types) and handling of the `response_format` chat completion parameter. Both of these required a few more Pydantic model definitions in our Inference API, just to move from the not-quite-right stubs I had in place to something fleshed out to match the actual OpenAI API specs. As part of testing this, I also found and fixed a bug in the litellm implementation of openai_completion and openai_chat_completion, so the providers based on those should actually be working now. The method `prepare_openai_completion_params` in `llama_stack/providers/utils/inference/openai_compat.py` was improved to actually recursively clean up input parameters, including handling of lists, dicts, and dumping of Pydantic models to dicts. These changes were required to get to 100% passing tests on the OpenAI API verification against the `openai` provider. With the above, the together.ai provider was passing as well as it is without Llama Stack. But, since we have Llama Stack in the middle, I took the opportunity to clean up the together.ai provider so that it now also passes the OpenAI API spec tests we have at 100%. That means together.ai is now passing our verification test better when using an OpenAI client talking to Llama Stack than it is when hitting together.ai directly, without Llama Stack in the middle. And, another round of work for Fireworks to improve translation of incoming OpenAI chat completion requests to Llama Stack chat completion requests gets the fireworks provider passing at 100%. The server-side fireworks.ai tool calling support with OpenAI chat completions and Llama 4 models isn't great yet, but by pointing the OpenAI clients at Llama Stack's API we can clean things up and get everything working as expected for Llama 4 models. ## Test Plan ### OpenAI API Verification Tests I ran the OpenAI API verification tests as below and 100% of the tests passed. First, start a Llama Stack server that runs the `openai` provider with the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template setup to do this out of the box, so I added a `tests/verifications/openai-api-verification-run.yaml` to do this. First, ensure you have the necessary API key environment variables set: ``` export TOGETHER_API_KEY="..." export FIREWORKS_API_KEY="..." export OPENAI_API_KEY="..." ``` Then, run a Llama Stack server that serves up all these providers: ``` llama stack run \ --image-type venv \ tests/verifications/openai-api-verification-run.yaml ``` Finally, generate a new verification report against all these providers, both with and without the Llama Stack server in the middle. ``` python tests/verifications/generate_report.py \ --run-tests \ --provider \ together \ fireworks \ groq \ openai \ together-llama-stack \ fireworks-llama-stack \ groq-llama-stack \ openai-llama-stack ``` You'll see that most of the configurations with Llama Stack in the middle now pass at 100%, even though some of them do not pass at 100% when hitting the backend provider's API directly with an OpenAI client. ### OpenAI Completion Integration Tests with vLLM: I also ran the smaller `test_openai_completion.py` test suite (that's not yet merged with the verification tests) on multiple of the providers, since I had to adjust the method signature of openai_chat_completion a bit and thus had to touch lots of these providers to match. Here's the tests I ran there, all passing: ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct" ``` ### OpenAI Completion Integration Tests with ollama ``` INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0" ``` ### OpenAI Completion Integration Tests with together.ai ``` INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo" ``` ### OpenAI Completion Integration Tests with fireworks.ai ``` INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run ``` in another terminal ``` LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct" --------- Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-14 08:56:29 -07:00
Sébastien Han	69554158fa	feat: add health to all providers through providers endpoint (#1418 ) The `/v1/providers` now reports the health status of each provider when implemented. ``` curl -L http://127.0.0.1:8321/v1/providers\|jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 4072 100 4072 0 0 246k 0 --:--:-- --:--:-- --:--:-- 248k { "data": [ { "api": "inference", "provider_id": "ollama", "provider_type": "remote::ollama", "config": { "url": "http://localhost:11434" }, "health": { "status": "OK" } }, { "api": "vector_io", "provider_id": "faiss", "provider_type": "inline::faiss", "config": { "kvstore": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/faiss_store.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "safety", "provider_id": "llama-guard", "provider_type": "inline::llama-guard", "config": { "excluded_categories": [] }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "agents", "provider_id": "meta-reference", "provider_type": "inline::meta-reference", "config": { "persistence_store": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/agents_store.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "telemetry", "provider_id": "meta-reference", "provider_type": "inline::meta-reference", "config": { "service_name": "llama-stack", "sinks": "console,sqlite", "sqlite_db_path": "/Users/leseb/.llama/distributions/ollama/trace_store.db" }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "eval", "provider_id": "meta-reference", "provider_type": "inline::meta-reference", "config": { "kvstore": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/meta_reference_eval.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "datasetio", "provider_id": "huggingface", "provider_type": "remote::huggingface", "config": { "kvstore": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/huggingface_datasetio.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "datasetio", "provider_id": "localfs", "provider_type": "inline::localfs", "config": { "kvstore": { "type": "sqlite", "namespace": null, "db_path": "/Users/leseb/.llama/distributions/ollama/localfs_datasetio.db" } }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "scoring", "provider_id": "basic", "provider_type": "inline::basic", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "scoring", "provider_id": "llm-as-judge", "provider_type": "inline::llm-as-judge", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "scoring", "provider_id": "braintrust", "provider_type": "inline::braintrust", "config": { "openai_api_key": "******" }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "brave-search", "provider_type": "remote::brave-search", "config": { "api_key": "****", "max_results": 3 }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "tavily-search", "provider_type": "remote::tavily-search", "config": { "api_key": "****", "max_results": 3 }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "code-interpreter", "provider_type": "inline::code-interpreter", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "rag-runtime", "provider_type": "inline::rag-runtime", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "model-context-protocol", "provider_type": "remote::model-context-protocol", "config": {}, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } }, { "api": "tool_runtime", "provider_id": "wolfram-alpha", "provider_type": "remote::wolfram-alpha", "config": { "api_key": "******" }, "health": { "status": "Not Implemented", "message": "Provider does not implement health check" } } ] } ``` Per providers too: ``` curl -L http://127.0.0.1:8321/v1/providers/ollama {"api":"inference","provider_id":"ollama","provider_type":"remote::ollama","config":{"url":"http://localhost:11434"},"health":{"status":"OK"}} ``` Signed-off-by: Sébastien Han <seb@redhat.com>	2025-04-14 11:59:36 +02:00
Ashwin Bharambe	f34f22f8c7	feat: add batch inference API to llama stack inference (#1945 ) # What does this PR do? This PR adds two methods to the Inference API: - `batch_completion` - `batch_chat_completion` The motivation is for evaluations targeting a local inference engine (like meta-reference or vllm) where batch APIs provide for a substantial amount of acceleration. Why did I not add this to `Api.batch_inference` though? That just resulted in a _lot_ more book-keeping given the structure of Llama Stack. Had I done that, I would have needed to create a notion of a "batch model" resource, setup routing based on that, etc. This does not sound ideal. So what's the future of the batch inference API? I am not sure. Maybe we can keep it for true _asynchronous_ execution. So you can submit requests, and it can return a Job instance, etc. ## Test Plan Run meta-reference-gpu using: ```bash export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000 export MODEL_PARALLEL_SIZE=4 export MAX_BATCH_SIZE=32 export MAX_SEQ_LEN=6144 LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu ``` Then run the batch inference test case.	2025-04-12 11:41:12 -07:00
Ben Browning	2b2db5fbda	feat: OpenAI-Compatible models, completions, chat/completions (#1894 ) # What does this PR do? This stubs in some OpenAI server-side compatibility with three new endpoints: /v1/openai/v1/models /v1/openai/v1/completions /v1/openai/v1/chat/completions This gives common inference apps using OpenAI clients the ability to talk to Llama Stack using an endpoint like http://localhost:8321/v1/openai/v1 . The two "v1" instances in there isn't awesome, but the thinking is that Llama Stack's API is v1 and then our OpenAI compatibility layer is compatible with OpenAI V1. And, some OpenAI clients implicitly assume the URL ends with "v1", so this gives maximum compatibility. The openai models endpoint is implemented in the routing layer, and just returns all the models Llama Stack knows about. The following providers should be working with the new OpenAI completions and chat/completions API: * remote::anthropic (untested) * remote::cerebras-openai-compat (untested) * remote::fireworks (tested) * remote::fireworks-openai-compat (untested) * remote::gemini (untested) * remote::groq-openai-compat (untested) * remote::nvidia (tested) * remote::ollama (tested) * remote::openai (untested) * remote::passthrough (untested) * remote::sambanova-openai-compat (untested) * remote::together (tested) * remote::together-openai-compat (untested) * remote::vllm (tested) The goal to support this for every inference provider - proxying directly to the provider's OpenAI endpoint for OpenAI-compatible providers. For providers that don't have an OpenAI-compatible API, we'll add a mixin to translate incoming OpenAI requests to Llama Stack inference requests and translate the Llama Stack inference responses to OpenAI responses. This is related to #1817 but is a bit larger in scope than just chat completions, as I have real use-cases that need the older completions API as well. ## Test Plan ### vLLM ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct" ``` ### ollama ``` INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0" ``` ## Documentation Run a Llama Stack distribution that uses one of the providers mentioned in the list above. Then, use your favorite OpenAI client to send completion or chat completion requests with the base_url set to http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the host and port of your Llama Stack server, if different. --------- Signed-off-by: Ben Browning <bbrownin@redhat.com>	2025-04-11 13:14:17 -07:00
Paolo Dettori	22814299b0	fix: solve unregister_toolgroup error (#1608 ) # What does this PR do? Fixes issue #1537 that causes "500 Internal Server Error" when unregistering a toolgroup # (Closes #1537 ) ## Test Plan ```console $ pytest -s -v tests/integration/tool_runtime/test_registration.py --stack-config=ollama --env INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" INFO 2025-03-14 21:15:03,999 tests.integration.conftest:41 tests: Setting DISABLE_CODE_SANDBOX=1 for macOS /opt/homebrew/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset. The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session" warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET)) ===================================================== test session starts ===================================================== platform darwin -- Python 3.10.16, pytest-8.3.5, pluggy-1.5.0 -- /opt/homebrew/opt/python@3.10/bin/python3.10 cachedir: .pytest_cache rootdir: /Users/paolo/Projects/aiplatform/llama-stack configfile: pyproject.toml plugins: asyncio-0.25.3, anyio-4.8.0 asyncio: mode=strict, asyncio_default_fixture_loop_scope=None collected 1 item tests/integration/tool_runtime/test_registration.py::test_register_and_unregister_toolgroup[None-None-None-None-None] INFO 2025-03-14 21:15:04,478 llama_stack.providers.remote.inference.ollama.ollama:75 inference: checking connectivity to Ollama at `http://localhost:11434`... INFO 2025-03-14 21:15:05,350 llama_stack.providers.remote.inference.ollama.ollama:294 inference: Pulling embedding model `all-minilm:latest` if necessary... INFO: Started server process [78391] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) INFO: 127.0.0.1:57424 - "GET /sse HTTP/1.1" 200 OK INFO: 127.0.0.1:57434 - "GET /sse HTTP/1.1" 200 OK INFO 2025-03-14 21:15:16,129 mcp.client.sse:51 uncategorized: Connecting to SSE endpoint: http://localhost:8000/sse INFO: 127.0.0.1:57445 - "GET /sse HTTP/1.1" 200 OK INFO 2025-03-14 21:15:16,146 mcp.client.sse:71 uncategorized: Received endpoint URL: http://localhost:8000/messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b INFO 2025-03-14 21:15:16,147 mcp.client.sse:140 uncategorized: Starting post writer with endpoint URL: http://localhost:8000/messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b INFO: 127.0.0.1:57447 - "POST /messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b HTTP/1.1" 202 Accepted INFO: 127.0.0.1:57447 - "POST /messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b HTTP/1.1" 202 Accepted INFO: 127.0.0.1:57447 - "POST /messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b HTTP/1.1" 202 Accepted INFO 2025-03-14 21:15:16,155 mcp.server.lowlevel.server:535 uncategorized: Processing request of type ListToolsRequest PASSED =============================================== 1 passed, 4 warnings in 12.17s ================================================ ``` --------- Signed-off-by: Paolo Dettori <dettori@us.ibm.com>	2025-04-09 10:56:07 +02:00
Ihar Hrachyshka	0a895c70d1	fix(api): don't return list for runtime tools (#1686 ) # What does this PR do? Don't return list for runtime tools. Instead return Response object for pagination and consistency with other APIs. --------- Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-04-01 09:53:11 +02:00
Sébastien Han	2ffa2b77ed	refactor: extract pagination logic into shared helper function (#1770 ) # What does this PR do? Move pagination logic from LocalFS and HuggingFace implementations into a common helper function to ensure consistent pagination behavior across providers. This reduces code duplication and centralizes pagination logic in one place. ## Test Plan Run this script: ``` from llama_stack_client import LlamaStackClient # Initialize the client client = LlamaStackClient(base_url="http://localhost:8321") # Register a dataset response = client.datasets.register( purpose="eval/messages-answer", # or "eval/question-answer" or "post-training/messages" source={"type": "uri", "uri": "huggingface://datasets/llamastack/simpleqa?split=train"}, dataset_id="my_dataset", # optional, will be auto-generated if not provided metadata={"description": "My evaluation dataset"}, # optional ) # Verify the dataset was registered by listing all datasets datasets = client.datasets.list() print(f"Registered datasets: {[d.identifier for d in datasets]}") # You can then access the data using the datasetio API # rows = client.datasets.iterrows(dataset_id="my_dataset", start_index=1, limit=2) rows = client.datasets.iterrows(dataset_id="my_dataset") print(f"Data: {rows.data}") ``` And play with `start_index` and `limit`. [//]: # (## Documentation) Signed-off-by: Sébastien Han <seb@redhat.com>	2025-03-31 13:08:29 -07:00
Xi Yan	baf68c665c	fix: fix jobs api literal return type (#1757 ) # What does this PR do? - We cannot directly return a literal type > Note: this is not final jobs API change [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan <img width="837" alt="image" src="https://github.com/user-attachments/assets/18a17561-35f9-443d-987d-54afdd6ff40c" /> [//]: # (## Documentation)	2025-03-21 14:04:21 -07:00
Ashwin Bharambe	03b5c61bfc	feat: make sure agent sessions are under access control (#1737 ) This builds on top of #1703. Agent sessions are now properly access controlled. ## Test Plan Added unit tests	2025-03-21 07:31:16 -07:00
Ashwin Bharambe	01a25d9744	feat(server): add attribute based access control for resources (#1703 ) This PR introduces a way to implement Attribute Based Access Control (ABAC) for the Llama Stack server. The rough design is: - https://github.com/meta-llama/llama-stack/pull/1626 added a way for the Llama Stack server to query an authenticator - We build upon that and expect "access attributes" as part of the response. These attributes indicate the scopes available for the request. - We use these attributes to perform access control for registered resources as well as for constructing the default access control policies for newly created resources. - By default, if you support authentication but don't return access attributes, we will add a unique namespace pointing to the API_KEY. That way, all resources by default will be scoped to API_KEYs. An important aspect of this design is that Llama Stack stays out of the business of credential management or the CRUD for attributes. How you manage your namespaces or projects is entirely up to you. The design only implements access control checks for the metadata / book-keeping information that the Stack tracks. ### Limitations - Currently, read vs. write vs. admin permissions aren't made explicit, but this can be easily extended by adding appropriate attributes to the `AccessAttributes` data structure. - This design does not apply to agent instances since they are not considered resources the Stack knows about. Agent instances are completely within the scope of the Agents API provider. ### Test Plan Added unit tests, existing integration tests	2025-03-19 21:28:52 -07:00
ehhuang	1902e5754c	fix: toolgroups unregister (#1704 ) # What does this PR do? FAILED tests/integration/tools/test_tools.py::test_toolsgroups_unregister[None] - AttributeError: 'coroutine' object has no attribute 'data' ## Test Plan LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/integration/tools/test_tools.py --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1704). * #1705 * __->__ #1704	2025-03-19 13:43:51 -07:00
Botao Chen	ab777ef5cd	fix: fix open-benchmark template (#1695 ) ## What does this PR do? open-benchmark templated is broken after the datasets api refactor due to 2 reasons - provider_id and provider_resource_id are no longer needed - the type in run.yaml will be resolved as dict this PR is to fix the above 2 issues ## Test spin up a llama stack server successfully with llama stack run `llama_stack/templates/open-benchmark/run.yaml`	2025-03-19 11:27:11 -07:00
Sébastien Han	c029fbcd13	fix: return 4xx for non-existent resources in GET requests (#1635 ) # What does this PR do? - Removed Optional return types for GET methods - Raised ValueError when requested resource is not found - Ensures proper 4xx response for missing resources - Updated the API generator to check for wrong signatures ``` $ uv run --with ".[dev]" ./docs/openapi_generator/run_openapi_generator.sh Validating API method return types... API Method Return Type Validation Errors: Method ScoringFunctions.get_scoring_function returns Optional type ``` Closes: https://github.com/meta-llama/llama-stack/issues/1630 ## Test Plan Run the server then: ``` curl http://127.0.0.1:8321/v1/models/foo {"detail":"Invalid value: Model 'foo' not found"}% ``` Server log: ``` INFO: 127.0.0.1:52307 - "GET /v1/models/foo HTTP/1.1" 400 Bad Request 09:51:42.654 [END] /v1/models/foo [StatusCode.OK] (134.65ms) 09:51:42.651 [ERROR] Error executing endpoint route='/v1/models/{model_id:path}' method='get' Traceback (most recent call last): File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py", line 193, in endpoint return await maybe_await(value) File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py", line 156, in maybe_await return await value File "/Users/leseb/Documents/AI/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper result = await method(self, args, *kwargs) File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 217, in get_model raise ValueError(f"Model '{model_id}' not found") ValueError: Model 'foo' not found ``` Signed-off-by: Sébastien Han <seb@redhat.com>	2025-03-18 14:06:53 -07:00
Xi Yan	5287b437ae	feat(api): (1/n) datasets api clean up (#1573 ) ## PR Stack - https://github.com/meta-llama/llama-stack/pull/1573 - https://github.com/meta-llama/llama-stack/pull/1625 - https://github.com/meta-llama/llama-stack/pull/1656 - https://github.com/meta-llama/llama-stack/pull/1657 - https://github.com/meta-llama/llama-stack/pull/1658 - https://github.com/meta-llama/llama-stack/pull/1659 - https://github.com/meta-llama/llama-stack/pull/1660 Client SDK - https://github.com/meta-llama/llama-stack-client-python/pull/203 CI - `1391130488` <img width="1042" alt="image" src="https://github.com/user-attachments/assets/69636067-376d-436b-9204-896e2dd490ca" /> -- the test_rag_agent_with_attachments is flaky and not related to this PR ## Doc <img width="789" alt="image" src="https://github.com/user-attachments/assets/b88390f3-73d6-4483-b09a-a192064e32d9" /> ## Client Usage ```python client.datasets.register( source={ "type": "uri", "uri": "lsfs://mydata.jsonl", }, schema="jsonl_messages", # optional dataset_id="my_first_train_data" ) # quick prototype debugging client.datasets.register( data_reference={ "type": "rows", "rows": [ "messages": [...], ], }, schema="jsonl_messages", ) ``` ## Test Plan - CI: `1387805545` ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/datasets/test_datasets.py ``` ``` LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/scoring/test_scoring.py ``` ``` pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb ```	2025-03-17 16:55:45 -07:00
Dinesh Yeduguru	99bbe0e70b	feat: Add new compact MetricInResponse type (#1593 ) # What does this PR do? This change adds a compact type to include metrics in response as opposed to the full MetricEvent which is relevant for internal logging purposes. ## Test Plan ``` LLAMA_STACK_CONFIG=~/.llama/distributions/fireworks/fireworks-run.yaml pytest -s -v agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml curl --request POST \ --url http://localhost:8321/v1/inference/chat-completion \ --header 'content-type: application/json' \ --data '{ "model_id": "meta-llama/Llama-3.1-70B-Instruct", "messages": [ { "role": "user", "content": { "type": "text", "text": "where do humans live" } } ], "stream": false }' { "metrics": [ { "metric": "prompt_tokens", "value": 10, "unit": null }, { "metric": "completion_tokens", "value": 522, "unit": null }, { "metric": "total_tokens", "value": 532, "unit": null } ], "completion_message": { "role": "assistant", "content": "Humans live in various parts of the world...............", "stop_reason": "out_of_tokens", "tool_calls": [] }, "logprobs": null } ```	2025-03-12 15:45:44 -07:00
ehhuang	1311faf3f5	fix: logging (#1598 ) Summary: Test Plan:	2025-03-12 14:57:31 -07:00
Dinesh Yeduguru	58d08d100e	feat: Add back inference metrics and preserve context variables across asyncio boundary (#1552 ) # What does this PR do? This PR adds back the changes in #1300 which were reverted in #1476 . It also adds logic to preserve context variables across asyncio boundary. this is needed with the library client since the async generator logic yields control to code outside the event loop, and on resuming, does not have the same context as before and this requires preserving the context vars. address #1477 ## Test Plan ``` curl --request POST \ --url http://localhost:8321/v1/inference/chat-completion \ --header 'content-type: application/json' \ --data '{ "model_id": "meta-llama/Llama-3.1-70B-Instruct", "messages": [ { "role": "user", "content": { "type": "text", "text": "where do humans live" } } ], "stream": false }' \| jq . { "metrics": [ { "trace_id": "kCZwO3tyQC-FuAGb", "span_id": "bsP_5a5O", "timestamp": "2025-03-11T16:47:38.549084Z", "attributes": { "model_id": "meta-llama/Llama-3.1-70B-Instruct", "provider_id": "fireworks" }, "type": "metric", "metric": "prompt_tokens", "value": 10, "unit": "tokens" }, { "trace_id": "kCZwO3tyQC-FuAGb", "span_id": "bsP_5a5O", "timestamp": "2025-03-11T16:47:38.549449Z", "attributes": { "model_id": "meta-llama/Llama-3.1-70B-Instruct", "provider_id": "fireworks" }, "type": "metric", "metric": "completion_tokens", "value": 369, "unit": "tokens" }, { "trace_id": "kCZwO3tyQC-FuAGb", "span_id": "bsP_5a5O", "timestamp": "2025-03-11T16:47:38.549457Z", "attributes": { "model_id": "meta-llama/Llama-3.1-70B-Instruct", "provider_id": "fireworks" }, "type": "metric", "metric": "total_tokens", "value": 379, "unit": "tokens" } ], "completion_message": { "role": "assistant", "content": "Humans live on the planet Earth, specifically on its landmasses and in its oceans. Here's a breakdown of where humans live:\n\n1. Continents: Humans inhabit all seven continents:\n\t* Africa\n\t* Antarctica ( temporary residents, mostly scientists and researchers)\n\t* Asia\n\t* Australia\n\t* Europe\n\t* North America\n\t* South America\n2. Countries: There are 196 countries recognized by the United Nations, and humans live in almost all of them.\n3. Cities and towns: Many humans live in urban areas, such as cities and towns, which are often located near coastlines, rivers, or other bodies of water.\n4. Rural areas: Some humans live in rural areas, such as villages, farms, and countryside.\n5. Islands: Humans inhabit many islands around the world, including those in the Pacific, Indian, and Atlantic Oceans.\n6. Mountains and highlands: Humans live in mountainous regions, such as the Himalayas, the Andes, and the Rocky Mountains.\n7. Deserts: Some humans live in desert regions, such as the Sahara, the Mojave, and the Atacama.\n8. Coastal areas: Many humans live in coastal areas, such as beaches, ports, and coastal cities.\n9. Underwater habitats: A few humans live in underwater habitats, such as research stations and submarines.\n10. Space: A small number of humans have lived in space, including astronauts on the International Space Station and those who have visited the Moon.\n\nOverall, humans can be found living in almost every environment on Earth, from the frozen tundra to the hottest deserts, and from the highest mountains to the deepest oceans.", "stop_reason": "end_of_turn", "tool_calls": [] }, "logprobs": null } ``` Orignal repro no longer showing any error: ``` LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml python -m examples.agents.e2e_loop_with_client_tools localhost 8321 ``` client logs: https://gist.github.com/dineshyv/047c7e87b18a5792aa660e311ea53166 server logs: https://gist.github.com/dineshyv/97a2174099619e9916c7c490be26e559	2025-03-12 12:01:03 -07:00
Sébastien Han	7cf1e24c4e	feat(logging): implement category-based logging (#1362 ) # What does this PR do? This commit introduces a new logging system that allows loggers to be assigned a category while retaining the logger name based on the file name. The log format includes both the logger name and the category, producing output like: ``` INFO 2025-03-03 21:44:11,323 llama_stack.distribution.stack:103 [core]: Tool_groups: builtin::websearch served by tavily-search ``` Key features include: - Category-based logging: Loggers can be assigned a category (e.g., "core", "server") when programming. The logger can be loaded like this: `logger = get_logger(name=__name__, category="server")` - Environment variable control: Log levels can be configured per-category using the `LLAMA_STACK_LOGGING` environment variable. For example: `LLAMA_STACK_LOGGING="server=DEBUG;core=debug"` enables DEBUG level for the "server" and "core" categories. - `LLAMA_STACK_LOGGING="all=debug"` sets DEBUG level globally for all categories and third-party libraries. This provides fine-grained control over logging levels while maintaining a clean and informative log format. The formatter uses the rich library which provides nice colors better stack traces like so: ``` ERROR 2025-03-03 21:49:37,124 asyncio:1758 [uncategorized]: unhandled exception during asyncio.run() shutdown task: <Task finished name='Task-16' coro=<handle_signal.<locals>.shutdown() done, defined at /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:146> exception=UnboundLocalError("local variable 'loop' referenced before assignment")> ╭────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────╮ │ /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:178 in shutdown │ │ │ │ 175 │ │ except asyncio.CancelledError: │ │ 176 │ │ │ pass │ │ 177 │ │ finally: │ │ ❱ 178 │ │ │ loop.stop() │ │ 179 │ │ │ 180 │ loop = asyncio.get_running_loop() │ │ 181 │ loop.create_task(shutdown()) │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ UnboundLocalError: local variable 'loop' referenced before assignment ``` Co-authored-by: Ashwin Bharambe <@ashwinb> Signed-off-by: Sébastien Han <seb@redhat.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml INFO 2025-03-03 21:55:35,918 __main__:365 [server]: Using config file: llama_stack/templates/ollama/run.yaml INFO 2025-03-03 21:55:35,925 __main__:378 [server]: Run configuration: INFO 2025-03-03 21:55:35,928 __main__:380 [server]: apis: - agents ``` [//]: # (## Documentation) --------- Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-03-07 11:34:30 -08:00
Dinesh Yeduguru	60e7f3d705	fix: Revert "feat: record token usage for inference API (#1300 )" (#1476 ) This reverts commit `b8535417e0`. Test plan: LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/together/together-run.yaml python -m examples.agents.e2e_loop_with_client_tools localhost 8321	2025-03-07 10:16:47 -08:00
Sébastien Han	803bf0e029	fix: solve ruff B008 warnings (#1444 ) # What does this PR do? The commit addresses the Ruff warning B008 by refactoring the code to avoid calling SamplingParams() directly in function argument defaults. Instead, it either uses Field(default_factory=SamplingParams) for Pydantic models or sets the default to None and instantiates SamplingParams inside the function body when the argument is None. Signed-off-by: Sébastien Han <seb@redhat.com>	2025-03-06 16:48:35 -08:00
Ihar Hrachyshka	4d4be03176	fix: don't import from llama_models (#1436 ) # What does this PR do? Some imports were not switched to in-tree copy of the modules. This is a follow-up to: https://github.com/meta-llama/llama-stack/pull/1344 Closes #1435 ## Test Plan Manually started the server... [//]: # (## Documentation) Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-03-05 15:30:38 -08:00

1 2 3

125 commits