mirror of
				https://github.com/meta-llama/llama-stack.git
				synced 2025-10-22 16:23:08 +00:00 
			
		
		
		
	
	
		
			411 commits
		
	
	
	| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|  | 2c43285e22 | feat(stores)!: use backend storage references instead of configs (#3697) **This PR changes configurations in a backward incompatible way.**
Run configs today repeat full SQLite/Postgres snippets everywhere a
store is needed, which means duplicated credentials, extra connection
pools, and lots of drift between files. This PR introduces named storage
backends so the stack and providers can share a single catalog and
reference those backends by name.
## Key Changes
- Add `storage.backends` to `StackRunConfig`, register each KV/SQL
backend once at startup, and validate that references point to the right
family.
- Move server stores under `storage.stores` with lightweight references
(backend + namespace/table) instead of full configs.
- Update every provider/config/doc to use the new reference style;
docs/codegen now surface the simplified YAML.
## Migration
Before:
```yaml
metadata_store:
  type: sqlite
  db_path: ~/.llama/distributions/foo/registry.db
inference_store:
  type: postgres
  host: ${env.POSTGRES_HOST}
  port: ${env.POSTGRES_PORT}
  db: ${env.POSTGRES_DB}
  user: ${env.POSTGRES_USER}
  password: ${env.POSTGRES_PASSWORD}
conversations_store:
  type: postgres
  host: ${env.POSTGRES_HOST}
  port: ${env.POSTGRES_PORT}
  db: ${env.POSTGRES_DB}
  user: ${env.POSTGRES_USER}
  password: ${env.POSTGRES_PASSWORD}
```
After:
```yaml
storage:
  backends:
    kv_default:
      type: kv_sqlite
      db_path: ~/.llama/distributions/foo/kvstore.db
    sql_default:
      type: sql_postgres
      host: ${env.POSTGRES_HOST}
      port: ${env.POSTGRES_PORT}
      db: ${env.POSTGRES_DB}
      user: ${env.POSTGRES_USER}
      password: ${env.POSTGRES_PASSWORD}
  stores:
    metadata:
      backend: kv_default
      namespace: registry
    inference:
      backend: sql_default
      table_name: inference_store
      max_write_queue_size: 10000
      num_writers: 4
    conversations:
      backend: sql_default
      table_name: openai_conversations
```
Provider configs follow the same pattern—for example, a Chroma vector
adapter switches from:
```yaml
providers:
  vector_io:
  - provider_id: chromadb
    provider_type: remote::chromadb
    config:
      url: ${env.CHROMADB_URL}
      kvstore:
        type: sqlite
        db_path: ~/.llama/distributions/foo/chroma.db
```
to:
```yaml
providers:
  vector_io:
  - provider_id: chromadb
    provider_type: remote::chromadb
    config:
      url: ${env.CHROMADB_URL}
      persistence:
        backend: kv_default
        namespace: vector_io::chroma_remote
```
Once the backends are declared, everything else just points at them, so
rotating credentials or swapping to Postgres happens in one place and
the stack reuses a single connection pool. | ||
|  | 165b8b07f4 | docs: Documentation update for NVIDIA Inference Provider (#3840) # What does this PR do? <!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. --> <!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> - Fix examples in the NVIDIA inference documentation to align with current API requirements. ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* --> N/A | ||
|  | add8cd801b | feat(gemini): Support gemini-embedding-001 and fix models/ prefix in metadata keys (#3813) # Add support for Google Gemini `gemini-embedding-001` embedding model and correctly registers model type MR message created with the assistance of Claude-4.5-sonnet This resolves https://github.com/llamastack/llama-stack/issues/3755 ## What does this PR do? This PR adds support for the `gemini-embedding-001` Google embedding model to the llama-stack Gemini provider. This model provides high-dimensional embeddings (3072 dimensions) compared to the existing `text-embedding-004` model (768 dimensions). Old embeddings models (such as text-embedding-004) will be deprecated soon according to Google ([Link](https://developers.googleblog.com/en/gemini-embedding-available-gemini-api/)) ## Problem The Gemini provider only supported the `text-embedding-004` embedding model. The newer `gemini-embedding-001` model, which provides higher-dimensional embeddings for improved semantic representation, was not available through llama-stack. ## Solution This PR consists of three commits that implement, fix the model registration, and enable embedding generation: ### Commit 1: Initial addition of gemini-embedding-001 Added metadata for `gemini-embedding-001` to the `embedding_model_metadata` dictionary: ```python embedding_model_metadata: dict[str, dict[str, int]] = { "text-embedding-004": {"embedding_dimension": 768, "context_length": 2048}, "gemini-embedding-001": {"embedding_dimension": 3072, "context_length": 2048}, # NEW } ``` **Issue discovered:** The model was not being registered correctly because the dictionary keys didn't match the model IDs returned by Gemini's API. ### Commit 2: Fix model ID matching with `models/` prefix Updated both dictionary keys to include the `models/` prefix to match Gemini's OpenAI-compatible API response format: ```python embedding_model_metadata: dict[str, dict[str, int]] = { "models/text-embedding-004": {"embedding_dimension": 768, "context_length": 2048}, # UPDATED "models/gemini-embedding-001": {"embedding_dimension": 3072, "context_length": 2048}, # UPDATED } ``` **Root cause:** Gemini's OpenAI-compatible API returns model IDs with the `models/` prefix (e.g., `models/text-embedding-004`). The `OpenAIMixin.list_models()` method directly matches these IDs against the `embedding_model_metadata` dictionary keys. Without the prefix, the models were being registered as LLMs instead of embedding models. ### Commit 3: Fix embedding generation for providers without usage stats Fixed a bug in `OpenAIMixin.openai_embeddings()` that prevented embedding generation for providers (like Gemini) that don't return usage statistics: ```python # Before (Line 351-354): usage = OpenAIEmbeddingUsage( prompt_tokens=response.usage.prompt_tokens, # ← Crashed with AttributeError total_tokens=response.usage.total_tokens, ) # After (Lines 351-362): if response.usage: usage = OpenAIEmbeddingUsage( prompt_tokens=response.usage.prompt_tokens, total_tokens=response.usage.total_tokens, ) else: usage = OpenAIEmbeddingUsage( prompt_tokens=0, # Default when not provided total_tokens=0, # Default when not provided ) ``` **Impact:** This fix enables embedding generation for **all** Gemini embedding models, not just the newly added one. ## Changes ### Modified Files **`llama_stack/providers/remote/inference/gemini/gemini.py`** - Line 17: Updated `text-embedding-004` key to `models/text-embedding-004` - Line 18: Added `models/gemini-embedding-001` with correct metadata **`llama_stack/providers/utils/inference/openai_mixin.py`** - Lines 351-362: Added null check for `response.usage` to handle providers without usage statistics ## Key Technical Details ### Model ID Matching Flow 1. `list_provider_model_ids()` calls Gemini's `/v1/models` endpoint 2. API returns model IDs like: `models/text-embedding-004`, `models/gemini-embedding-001` 3. `OpenAIMixin.list_models()` (line 410) checks: `if metadata := self.embedding_model_metadata.get(provider_model_id)` 4. If matched, registers as `model_type: "embedding"` with metadata; otherwise registers as `model_type: "llm"` ### Why Both Keys Needed the Prefix The `text-embedding-004` model was already working because there was likely separate configuration or manual registration handling it. For auto-discovery to work correctly for **both** models, both keys must match the API's model ID format exactly. ## How to test this PR Verified the changes by: 1. **Model Auto-Discovery**: Started llama-stack server and confirmed models are auto-discovered from Gemini API 2. **Model Registration**: Confirmed both embedding models are correctly registered and visible ```bash curl http://localhost:8325/v1/models | jq '.data[] | select(.provider_id == "gemini" and .model_type == "embedding")' ``` **Results:** - ✅ `gemini/models/text-embedding-004` - 768 dimensions - `model_type: "embedding"` - ✅ `gemini/models/gemini-embedding-001` - 3072 dimensions - `model_type: "embedding"` 3. **Before Fix (Commit 1)**: Models appeared as `model_type: "llm"` without embedding metadata 4. **After Fix (Commit 2)**: Models correctly identified as `model_type: "embedding"` with proper metadata 5. **Generate Embeddings**: Verified embedding generation works ```bash curl -X POST http://localhost:8325/v1/embeddings \ -H "Content-Type: application/json" \ -d '{"model": "gemini/models/gemini-embedding-001", "input": "test"}' | \ jq '.data[0].embedding | length' ``` | ||
|  | ef4bc70bbe | feat: Enable setting a default embedding model in the stack (#3803) 
		
			Some checks failed
		
		
	 SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Python Package Build Test / build (3.12) (push) Failing after 1s Python Package Build Test / build (3.13) (push) Failing after 1s Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s Vector IO Integration Tests / test-matrix (push) Failing after 4s Unit Tests / unit-tests (3.12) (push) Failing after 4s Test External API and Providers / test-external (venv) (push) Failing after 4s Unit Tests / unit-tests (3.13) (push) Failing after 5s API Conformance Tests / check-schema-compatibility (push) Successful in 11s UI Tests / ui-tests (22) (push) Successful in 40s Pre-commit / pre-commit (push) Successful in 1m28s # What does this PR do? Enables automatic embedding model detection for vector stores and by using a `default_configured` boolean that can be defined in the `run.yaml`. <!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> ## Test Plan - Unit tests - Integration tests - Simple example below: Spin up the stack: ```bash uv run llama stack build --distro starter --image-type venv --run ``` Then test with OpenAI's client: ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8321/v1/", api_key="none") vs = client.vector_stores.create() ``` Previously you needed: ```python vs = client.vector_stores.create( extra_body={ "embedding_model": "sentence-transformers/all-MiniLM-L6-v2", "embedding_dimension": 384, } ) ``` The `extra_body` is now unnecessary. --------- Signed-off-by: Francisco Javier Arceo <farceo@redhat.com> | ||
|  | d875e427bf | refactor: use extra_bodyto pass ininput_typeparams for asymmetric embedding models for NVIDIA Inference Provider (#3804)
		
			Some checks failed
		
		
	 SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Python Package Build Test / build (3.13) (push) Failing after 1s Test Llama Stack Build / generate-matrix (push) Successful in 4s Test Llama Stack Build / build-custom-container-distribution (push) Failing after 3s Python Package Build Test / build (3.12) (push) Failing after 2s Test Llama Stack Build / build-single-provider (push) Failing after 4s Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 3s Test External API and Providers / test-external (venv) (push) Failing after 5s Unit Tests / unit-tests (3.12) (push) Failing after 5s Test Llama Stack Build / build (push) Failing after 4s Unit Tests / unit-tests (3.13) (push) Failing after 5s Vector IO Integration Tests / test-matrix (push) Failing after 9s API Conformance Tests / check-schema-compatibility (push) Successful in 16s UI Tests / ui-tests (22) (push) Successful in 33s Pre-commit / pre-commit (push) Successful in 1m33s # What does this PR do? <!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. --> Previously, the NVIDIA inference provider implemented a custom `openai_embeddings` method with a hardcoded `input_type="query"` parameter, which is required by NVIDIA asymmetric embedding models([https://github.com/llamastack/llama-stack/pull/3205](https://github.com/llamastack/llama-stack/pull/3205)). Recently `extra_body` parameter is added to the embeddings API ([https://github.com/llamastack/llama-stack/pull/3794](https://github.com/llamastack/llama-stack/pull/3794)). So, this PR updates the NVIDIA inference provider to use the base `OpenAIMixin.openai_embeddings` method instead and pass the `input_type` through the `extra_body` parameter for asymmetric embedding models. <!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* --> Run the following command for the ```embedding_model```: ```nvidia/llama-3.2-nv-embedqa-1b-v2```, ```nvidia/nv-embedqa-e5-v5```, ```nvidia/nv-embedqa-mistral-7b-v2```, and ```snowflake/arctic-embed-l```. ``` pytest -s -v tests/integration/inference/test_openai_embeddings.py --stack-config="inference=nvidia" --embedding-model={embedding_model} --env NVIDIA_API_KEY={nvidia_api_key} --env NVIDIA_BASE_URL="https://integrate.api.nvidia.com" --inference-mode=record ``` | ||
|  | 0dbf79c328 | fix: Fixed WatsonX remote inference provider (#3801) 
		
			Some checks failed
		
		
	 Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 4s Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 4s Test Llama Stack Build / build-single-provider (push) Failing after 3s Test Llama Stack Build / generate-matrix (push) Successful in 5s SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 9s SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 9s Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Python Package Build Test / build (3.12) (push) Failing after 1s Python Package Build Test / build (3.13) (push) Failing after 1s Vector IO Integration Tests / test-matrix (push) Failing after 9s Test Llama Stack Build / build-custom-container-distribution (push) Failing after 3s API Conformance Tests / check-schema-compatibility (push) Successful in 13s Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 4s Unit Tests / unit-tests (3.12) (push) Failing after 4s Unit Tests / unit-tests (3.13) (push) Failing after 3s Test External API and Providers / test-external (venv) (push) Failing after 5s Test Llama Stack Build / build (push) Failing after 31s UI Tests / ui-tests (22) (push) Successful in 46s Pre-commit / pre-commit (push) Successful in 2m13s # What does this PR do? This PR fixes issues with the WatsonX provider so it works correctly with LiteLLM. The main problem was that WatsonX requests failed because the provider data validator didn’t properly handle the API key and project ID. This was fixed by updating the WatsonXProviderDataValidator and ensuring the provider data is loaded correctly. The openai_chat_completion method was also updated to match the behavior of other providers while adding WatsonX-specific fields like project_id. It still calls await super().openai_chat_completion.__func__(self, params) to keep the existing setup and tracing logic. After these changes, WatsonX requests now run correctly. ## Test Plan The changes were tested by running chat completion requests and confirming that credentials and project parameters are passed correctly. I have tested with my WatsonX credentials, by using the cli with `uv run llama-stack-client inference chat-completion --session` --------- Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Sébastien Han <seb@redhat.com> | ||
|  | ecc8a554d2 | feat(api)!: support extra_body to embeddings and vector_stores APIs (#3794) 
		
			Some checks failed
		
		
	 Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 0s Python Package Build Test / build (3.12) (push) Failing after 1s Unit Tests / unit-tests (3.13) (push) Failing after 4s SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Python Package Build Test / build (3.13) (push) Failing after 1s Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s Vector IO Integration Tests / test-matrix (push) Failing after 5s Test External API and Providers / test-external (venv) (push) Failing after 5s Unit Tests / unit-tests (3.12) (push) Failing after 4s API Conformance Tests / check-schema-compatibility (push) Successful in 10s UI Tests / ui-tests (22) (push) Successful in 40s Pre-commit / pre-commit (push) Successful in 1m23s Applies the same pattern from https://github.com/llamastack/llama-stack/pull/3777 to embeddings and vector_stores.create() endpoints. This should _not_ be a breaking change since (a) our tests were already using the `extra_body` parameter when passing in to the backend (b) but the backend probably wasn't extracting the parameters correctly. This PR will fix that. Updated APIs: `openai_embeddings(), openai_create_vector_store(), openai_create_vector_store_file_batch()` | ||
|  | 3bb6ef351b | chore!: Safety api refactoring to use OpenAIMessageParam (#3796) 
		
			Some checks failed
		
		
	 SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Python Package Build Test / build (3.12) (push) Failing after 1s Python Package Build Test / build (3.13) (push) Failing after 1s Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s Test External API and Providers / test-external (venv) (push) Failing after 4s Vector IO Integration Tests / test-matrix (push) Failing after 6s Unit Tests / unit-tests (3.12) (push) Failing after 4s Unit Tests / unit-tests (3.13) (push) Failing after 3s API Conformance Tests / check-schema-compatibility (push) Successful in 13s UI Tests / ui-tests (22) (push) Successful in 40s Pre-commit / pre-commit (push) Successful in 1m28s # What does this PR do? Remove usage of deprecated `Message` from Safety apis ## Test Plan CI | ||
|  | 06e4cd8e02 | feat(api)!: BREAKING CHANGE: support passing extra_bodythrough to providers  (#3777)
		
			Some checks failed
		
		
	 SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Python Package Build Test / build (3.12) (push) Failing after 1s Python Package Build Test / build (3.13) (push) Failing after 1s Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Vector IO Integration Tests / test-matrix (push) Failing after 5s API Conformance Tests / check-schema-compatibility (push) Successful in 9s Test External API and Providers / test-external (venv) (push) Failing after 4s Unit Tests / unit-tests (3.12) (push) Failing after 4s Unit Tests / unit-tests (3.13) (push) Failing after 4s UI Tests / ui-tests (22) (push) Successful in 38s Pre-commit / pre-commit (push) Successful in 1m27s # What does this PR do? Allows passing through extra_body parameters to inference providers. With this, we removed the 2 vllm-specific parameters from completions API into `extra_body`. Before/After <img width="1883" height="324" alt="image" src="https://github.com/user-attachments/assets/acb27c08-c748-46c9-b1da-0de64e9908a1" /> closes #2720 ## Test Plan CI and added new test ``` ❯ uv run pytest -s -v tests/integration/ --stack-config=server:starter --inference-mode=record -k 'not( builtin_tool or safety_with_image or code_interpreter or test_rag ) and test_openai_completion_guided_choice' --setup=vllm --suite=base --color=yes Uninstalled 3 packages in 125ms Installed 3 packages in 19ms INFO 2025-10-10 14:29:54,317 tests.integration.conftest:118 tests: Applying setup 'vllm' for suite base INFO 2025-10-10 14:29:54,331 tests.integration.conftest:47 tests: Test stack config type: server (stack_config=server:starter) ============================================================================================================== test session starts ============================================================================================================== platform darwin -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0 -- /Users/erichuang/projects/llama-stack-1/.venv/bin/python cachedir: .pytest_cache metadata: {'Python': '3.12.11', 'Platform': 'macOS-15.6.1-arm64-arm-64bit', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'anyio': '4.9.0', 'html': '4.1.1', 'socket': '0.7.0', 'asyncio': '1.1.0', 'json-report': '1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'cov': '6.2.1', 'nbval': '0.11.0'}} rootdir: /Users/erichuang/projects/llama-stack-1 configfile: pyproject.toml plugins: anyio-4.9.0, html-4.1.1, socket-0.7.0, asyncio-1.1.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, cov-6.2.1, nbval-0.11.0 asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function collected 285 items / 284 deselected / 1 selected tests/integration/inference/test_openai_completion.py::test_openai_completion_guided_choice[txt=vllm/Qwen/Qwen3-0.6B] instantiating llama_stack_client Starting llama stack server with config 'starter' on port 8321... Waiting for server at http://localhost:8321... (0.0s elapsed) Waiting for server at http://localhost:8321... (0.5s elapsed) Waiting for server at http://localhost:8321... (5.1s elapsed) Waiting for server at http://localhost:8321... (5.6s elapsed) Waiting for server at http://localhost:8321... (10.1s elapsed) Waiting for server at http://localhost:8321... (10.6s elapsed) Server is ready at http://localhost:8321 llama_stack_client instantiated in 11.773s PASSEDTerminating llama stack server process... Terminating process 98444 and its group... Server process and children terminated gracefully ============================================================================================================= slowest 10 durations ============================================================================================================== 11.88s setup tests/integration/inference/test_openai_completion.py::test_openai_completion_guided_choice[txt=vllm/Qwen/Qwen3-0.6B] 3.02s call tests/integration/inference/test_openai_completion.py::test_openai_completion_guided_choice[txt=vllm/Qwen/Qwen3-0.6B] 0.01s teardown tests/integration/inference/test_openai_completion.py::test_openai_completion_guided_choice[txt=vllm/Qwen/Qwen3-0.6B] ================================================================================================ 1 passed, 284 deselected, 3 warnings in 16.21s ================================================================================================= ``` | ||
|  | 80d58ab519 | chore: refactor (chat)completions endpoints to use shared params struct (#3761) # What does this PR do? Converts openai(_chat)_completions params to pydantic BaseModel to reduce code duplication across all providers. ## Test Plan CI --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/llamastack/llama-stack/pull/3761). * #3777 * __->__ #3761 | ||
|  | 548ccff368 | fix(mypy): fix wrong attribute access (#3770) | ||
|  | 0066d986c5 | feat: use SecretStr for inference provider auth credentials (#3724) # What does this PR do? use SecretStr for OpenAIMixin providers - RemoteInferenceProviderConfig now has auth_credential: SecretStr - the default alias is api_key (most common name) - some providers override to use api_token (RunPod, vLLM, Databricks) - some providers exclude it (Ollama, TGI, Vertex AI) addresses #3517 ## Test Plan ci w/ new tests | ||
|  | a548169b99 | fix: allow skipping model availability check for vLLM (#3739) # What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Allows model check to fail gracefully instead of crashing on startup.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
set VLLM_URL to your VLLM server 
```
(base) akram@Mac llama-stack % LAMA_STACK_LOGGING="all=debug" VLLM_ENABLE_MODEL_DISCOVERY=false  MILVUS_DB_PATH=./milvus.db INFERENCE_MODEL=vllm uv run --with llama-stack llama stack build --distro starter  --image-type venv --run
```
```
INFO     2025-10-08 20:11:24,637 llama_stack.providers.utils.inference.inference_store:74 inference: Write queue disabled for SQLite to avoid concurrency issues
INFO     2025-10-08 20:11:24,866 llama_stack.providers.utils.responses.responses_store:96 openai_responses: Write queue disabled for SQLite to avoid concurrency issues
ERROR    2025-10-08 20:11:26,160 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: VLLMInferenceAdapter.list_provider_model_ids() failed with: <a
         href="https://oauth.akram.a1ey.p3.openshiftapps.com:443/oauth/authorize?approval_prompt=force&client_id=system%3Aserviceaccount%3Arhoai-30-genai%3Adefault&redirect_uri=ht
         tps%3A%2F%2Fvllm-rhoai-30-genai.apps.rosa.akram.a1ey.p3.openshiftapps.com%2Foauth%2Fcallback&response_type=code&scope=user%3Ainfo+user%3Acheck-access&state=9fba207425
         5851c718aca717a5887d76%3A%2Fmodels">Found</a>.
         
[...]
INFO     2025-10-08 20:11:26,295 uvicorn.error:84 uncategorized: Started server process [83144]
INFO     2025-10-08 20:11:26,296 uvicorn.error:48 uncategorized: Waiting for application startup.
INFO     2025-10-08 20:11:26,297 llama_stack.core.server.server:170 core::server: Starting up
INFO     2025-10-08 20:11:26,297 llama_stack.core.stack:399 core: starting registry refresh task
INFO     2025-10-08 20:11:26,311 uvicorn.error:62 uncategorized: Application startup complete.
INFO     2025-10-08 20:11:26,312 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
ERROR    2025-10-08 20:11:26,791 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: VLLMInferenceAdapter.list_provider_model_ids() failed with: <a
         href="https://oauth.akram.a1ey.p3.openshiftapps.com:443/oauth/authorize?approval_prompt=force&client_id=system%3Aserviceaccount%3Arhoai-30-genai%3Adefault&redirect_uri=ht
         tps%3A%2F%2Fvllm-rhoai-30-genai.apps.rosa.akram.a1ey.p3.openshiftapps.com%2Foauth%2Fcallback&response_type=code&scope=user%3Ainfo+user%3Acheck-access&state=8ef0cba3e1
         71a4f8b04cb445cfb91a4c%3A%2Fmodels">Found</a>.
``` | ||
|  | f50ce11a3b | feat(tests): make inference_recorder into api_recorder (include tool_invoke) (#3403) Renames `inference_recorder.py` to `api_recorder.py` and extends it to support recording/replaying tool invocations in addition to inference calls. This allows us to record web-search, etc. tool calls and thereafter apply recordings for `tests/integration/responses` ## Test Plan ``` export OPENAI_API_KEY=... export TAVILY_SEARCH_API_KEY=... ./scripts/integration-tests.sh --stack-config ci-tests \ --suite responses --inference-mode record-if-missing ``` | ||
|  | 5d711d4bcb | fix: Update watsonx.ai provider to use LiteLLM mixin and list all models (#3674) 
		
			Some checks failed
		
		
	 SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 1s SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 3s Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s Python Package Build Test / build (3.13) (push) Failing after 2s Python Package Build Test / build (3.12) (push) Failing after 3s Vector IO Integration Tests / test-matrix (push) Failing after 7s Test Llama Stack Build / generate-matrix (push) Successful in 6s Test Llama Stack Build / build-single-provider (push) Failing after 4s Test Llama Stack Build / build-custom-container-distribution (push) Failing after 5s Test External API and Providers / test-external (venv) (push) Failing after 4s Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 6s Unit Tests / unit-tests (3.13) (push) Failing after 4s API Conformance Tests / check-schema-compatibility (push) Successful in 12s Test Llama Stack Build / build (push) Failing after 3s Unit Tests / unit-tests (3.12) (push) Failing after 5s UI Tests / ui-tests (22) (push) Successful in 32s Pre-commit / pre-commit (push) Successful in 1m29s # What does this PR do? - The watsonx.ai provider now uses the LiteLLM mixin instead of using IBM's library, which does not seem to be working (see #3165 for context). - The watsonx.ai provider now lists all the models available by calling the watsonx.ai server instead of having a hard coded list of known models. (That list gets out of date quickly) - An edge case in [llama_stack/core/routers/inference.py](https://github.com/llamastack/llama-stack/pull/3674/files#diff-a34bc966ed9befd9f13d4883c23705dff49be0ad6211c850438cdda6113f3455) is addressed that was causing my manual tests to fail. - Fixes `b64_encode_openai_embeddings_response` which was trying to enumerate over a dictionary and then reference elements of the dictionary using .field instead of ["field"]. That method is called by the LiteLLM mixin for embedding models, so it is needed to get the watsonx.ai embedding models to work. - A unit test along the lines of the one in #3348 is added. A more comprehensive plan for automatically testing the end-to-end functionality for inference providers would be a good idea, but is out of scope for this PR. - Updates to the watsonx distribution. Some were in response to the switch to LiteLLM (e.g., updating the Python packages needed). Others seem to be things that were already broken that I found along the way (e.g., a reference to a watsonx specific doc template that doesn't seem to exist). Closes #3165 Also it is related to a line-item in #3387 but doesn't really address that goal (because it uses the LiteLLM mixin, not the OpenAI one). I tried the OpenAI one and it doesn't work with watsonx.ai, presumably because the watsonx.ai service is not OpenAI compatible. It works with LiteLLM because LiteLLM has a provider implementation for watsonx.ai. ## Test Plan The test script below goes back and forth between the OpenAI and watsonx providers. The idea is that the OpenAI provider shows how it should work and then the watsonx provider output shows that it is also working with watsonx. Note that the result from the MCP test is not as good (the Llama 3.3 70b model does not choose tools as wisely as gpt-4o), but it is still working and providing a valid response. For more details on setup and the MCP server being used for testing, see [the AI Alliance sample notebook](https://github.com/The-AI-Alliance/llama-stack-examples/blob/main/notebooks/01-responses/) that these examples are drawn from. ```python #!/usr/bin/env python3 import json from llama_stack_client import LlamaStackClient from litellm import completion import http.client def print_response(response): """Print response in a nicely formatted way""" print(f"ID: {response.id}") print(f"Status: {response.status}") print(f"Model: {response.model}") print(f"Created at: {response.created_at}") print(f"Output items: {len(response.output)}") for i, output_item in enumerate(response.output): if len(response.output) > 1: print(f"\n--- Output Item {i+1} ---") print(f"Output type: {output_item.type}") if output_item.type in ("text", "message"): print(f"Response content: {output_item.content[0].text}") elif output_item.type == "file_search_call": print(f" Tool Call ID: {output_item.id}") print(f" Tool Status: {output_item.status}") # 'queries' is a list, so we join it for clean printing print(f" Queries: {', '.join(output_item.queries)}") # Display results if they exist, otherwise note they are empty print(f" Results: {output_item.results if output_item.results else 'None'}") elif output_item.type == "mcp_list_tools": print_mcp_list_tools(output_item) elif output_item.type == "mcp_call": print_mcp_call(output_item) else: print(f"Response content: {output_item.content}") def print_mcp_call(mcp_call): """Print MCP call in a nicely formatted way""" print(f"\n🛠️ MCP Tool Call: {mcp_call.name}") print(f" Server: {mcp_call.server_label}") print(f" ID: {mcp_call.id}") print(f" Arguments: {mcp_call.arguments}") if mcp_call.error: print("Error: {mcp_call.error}") elif mcp_call.output: print("Output:") # Try to format JSON output nicely try: parsed_output = json.loads(mcp_call.output) print(json.dumps(parsed_output, indent=4)) except: # If not valid JSON, print as-is print(f" {mcp_call.output}") else: print(" ⏳ No output yet") def print_mcp_list_tools(mcp_list_tools): """Print MCP list tools in a nicely formatted way""" print(f"\n🔧 MCP Server: {mcp_list_tools.server_label}") print(f" ID: {mcp_list_tools.id}") print(f" Available Tools: {len(mcp_list_tools.tools)}") print("=" * 80) for i, tool in enumerate(mcp_list_tools.tools, 1): print(f"\n{i}. {tool.name}") print(f" Description: {tool.description}") # Parse and display input schema schema = tool.input_schema if schema and 'properties' in schema: properties = schema['properties'] required = schema.get('required', []) print(" Parameters:") for param_name, param_info in properties.items(): param_type = param_info.get('type', 'unknown') param_desc = param_info.get('description', 'No description') required_marker = " (required)" if param_name in required else " (optional)" print(f" • {param_name} ({param_type}){required_marker}") if param_desc: print(f" {param_desc}") if i < len(mcp_list_tools.tools): print("-" * 40) def main(): """Main function to run all the tests""" # Configuration LLAMA_STACK_URL = "http://localhost:8321/" LLAMA_STACK_MODEL_IDS = [ "openai/gpt-3.5-turbo", "openai/gpt-4o", "llama-openai-compat/Llama-3.3-70B-Instruct", "watsonx/meta-llama/llama-3-3-70b-instruct" ] # Using gpt-4o for this demo, but feel free to try one of the others or add more to run.yaml. OPENAI_MODEL_ID = LLAMA_STACK_MODEL_IDS[1] WATSONX_MODEL_ID = LLAMA_STACK_MODEL_IDS[-1] NPS_MCP_URL = "http://localhost:3005/sse/" print("=== Llama Stack Testing Script ===") print(f"Using OpenAI model: {OPENAI_MODEL_ID}") print(f"Using WatsonX model: {WATSONX_MODEL_ID}") print(f"MCP URL: {NPS_MCP_URL}") print() # Initialize client print("Initializing LlamaStackClient...") client = LlamaStackClient(base_url="http://localhost:8321") # Test 1: List models print("\n=== Test 1: List Models ===") try: models = client.models.list() print(f"Found {len(models)} models") except Exception as e: print(f"Error listing models: {e}") raise e # Test 2: Basic chat completion with OpenAI print("\n=== Test 2: Basic Chat Completion (OpenAI) ===") try: chat_completion_response = client.chat.completions.create( model=OPENAI_MODEL_ID, messages=[{"role": "user", "content": "What is the capital of France?"}] ) print("OpenAI Response:") for chunk in chat_completion_response.choices[0].message.content: print(chunk, end="", flush=True) print() except Exception as e: print(f"Error with OpenAI chat completion: {e}") raise e # Test 3: Basic chat completion with WatsonX print("\n=== Test 3: Basic Chat Completion (WatsonX) ===") try: chat_completion_response_wxai = client.chat.completions.create( model=WATSONX_MODEL_ID, messages=[{"role": "user", "content": "What is the capital of France?"}], ) print("WatsonX Response:") for chunk in chat_completion_response_wxai.choices[0].message.content: print(chunk, end="", flush=True) print() except Exception as e: print(f"Error with WatsonX chat completion: {e}") raise e # Test 4: Tool calling with OpenAI print("\n=== Test 4: Tool Calling (OpenAI) ===") tools = [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather for a specific location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g., San Francisco, CA", }, "unit": { "type": "string", "enum": ["celsius", "fahrenheit"] }, }, "required": ["location"], }, }, } ] messages = [ {"role": "user", "content": "What's the weather like in Boston, MA?"} ] try: print("--- Initial API Call ---") response = client.chat.completions.create( model=OPENAI_MODEL_ID, messages=messages, tools=tools, tool_choice="auto", # "auto" is the default ) print("OpenAI tool calling response received") except Exception as e: print(f"Error with OpenAI tool calling: {e}") raise e # Test 5: Tool calling with WatsonX print("\n=== Test 5: Tool Calling (WatsonX) ===") try: wxai_response = client.chat.completions.create( model=WATSONX_MODEL_ID, messages=messages, tools=tools, tool_choice="auto", # "auto" is the default ) print("WatsonX tool calling response received") except Exception as e: print(f"Error with WatsonX tool calling: {e}") raise e # Test 6: Streaming with WatsonX print("\n=== Test 6: Streaming Response (WatsonX) ===") try: chat_completion_response_wxai_stream = client.chat.completions.create( model=WATSONX_MODEL_ID, messages=[{"role": "user", "content": "What is the capital of France?"}], stream=True ) print("Model response: ", end="") for chunk in chat_completion_response_wxai_stream: # Each 'chunk' is a ChatCompletionChunk object. # We want the content from the 'delta' attribute. if hasattr(chunk, 'choices') and chunk.choices is not None: content = chunk.choices[0].delta.content # The first few chunks might have None content, so we check for it. if content is not None: print(content, end="", flush=True) print() except Exception as e: print(f"Error with streaming: {e}") raise e # Test 7: MCP with OpenAI print("\n=== Test 7: MCP Integration (OpenAI) ===") try: mcp_llama_stack_client_response = client.responses.create( model=OPENAI_MODEL_ID, input="Tell me about some parks in Rhode Island, and let me know if there are any upcoming events at them.", tools=[ { "type": "mcp", "server_url": NPS_MCP_URL, "server_label": "National Parks Service tools", "allowed_tools": ["search_parks", "get_park_events"], } ] ) print_response(mcp_llama_stack_client_response) except Exception as e: print(f"Error with MCP (OpenAI): {e}") raise e # Test 8: MCP with WatsonX print("\n=== Test 8: MCP Integration (WatsonX) ===") try: mcp_llama_stack_client_response = client.responses.create( model=WATSONX_MODEL_ID, input="What is the capital of France?" ) print_response(mcp_llama_stack_client_response) except Exception as e: print(f"Error with MCP (WatsonX): {e}") raise e # Test 9: MCP with Llama 3.3 print("\n=== Test 9: MCP Integration (Llama 3.3) ===") try: mcp_llama_stack_client_response = client.responses.create( model=WATSONX_MODEL_ID, input="Tell me about some parks in Rhode Island, and let me know if there are any upcoming events at them.", tools=[ { "type": "mcp", "server_url": NPS_MCP_URL, "server_label": "National Parks Service tools", "allowed_tools": ["search_parks", "get_park_events"], } ] ) print_response(mcp_llama_stack_client_response) except Exception as e: print(f"Error with MCP (Llama 3.3): {e}") raise e # Test 10: Embeddings print("\n=== Test 10: Embeddings ===") try: conn = http.client.HTTPConnection("localhost:8321") payload = json.dumps({ "model": "watsonx/ibm/granite-embedding-278m-multilingual", "input": "Hello, world!", }) headers = { 'Content-Type': 'application/json', 'Accept': 'application/json' } conn.request("POST", "/v1/openai/v1/embeddings", payload, headers) res = conn.getresponse() data = res.read() print(data.decode("utf-8")) except Exception as e: print(f"Error with Embeddings: {e}") raise e print("\n=== Testing Complete ===") if __name__ == "__main__": main() ``` --------- Signed-off-by: Bill Murdock <bmurdock@redhat.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> | ||
|  | 1ac320b7e6 | chore: remove dead code (#3729) # What does this PR do? Removing some dead code, found by vulture and checked by claude that there are no references or imports for these ## Test Plan CI | ||
|  | b6e9f41041 | chore: Revert "fix: fix nvidia provider (#3716)" (#3730) This reverts commit  | ||
|  | c940fe7938 | fix: fix nvidia provider (#3716) # What does this PR do? (Used claude to solve #3715, coded with claude but tested by me) ## From claude summary: <!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. --> **Problem**: The `NVIDIAInferenceAdapter` class was missing the `alias_to_provider_id_map` attribute, which caused the error: `ERROR 'NVIDIAInferenceAdapter' object has no attribute 'alias_to_provider_id_map'` **Root Cause**: The `NVIDIAInferenceAdapter` only inherited from `OpenAIMixin`, but some parts of the system expected it to have the `alias_to_provider_id_map` attribute, which is provided by the `ModelRegistryHelper` class. **Solution**: 1. **Added ModelRegistryHelper import**: Imported the `ModelRegistryHelper` class from `llama_stack.providers.utils.inference.model_registry` 2. **Updated inheritance**: Changed the class declaration to inherit from both `OpenAIMixin` and `ModelRegistryHelper` 3. **Added proper initialization**: Added an `__init__` method that properly initializes the `ModelRegistryHelper` with empty model entries (since NVIDIA uses dynamic model discovery) and the allowed models from the configuration **Key Changes**: * Added `from llama_stack.providers.utils.inference.model_registry import ModelRegistryHelper` * Changed class declaration from `class NVIDIAInferenceAdapter(OpenAIMixin):` to `class NVIDIAInferenceAdapter(OpenAIMixin, ModelRegistryHelper):` * Added `__init__` method that calls `ModelRegistryHelper.__init__(self, model_entries=[], allowed_models=config.allowed_models)` The inheritance order is important - `OpenAIMixin` comes first to ensure its `check_model_availability()` method takes precedence over the `ModelRegistryHelper` version, as mentioned in the class documentation. This fix ensures that the `NVIDIAInferenceAdapter` has the required `alias_to_provider_id_map` attribute while maintaining all existing functionality.<!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* --> Launching llama-stack server successfully, see logs: ``` NVIDIA_API_KEY=dummy NVIDIA_BASE_URL=http://localhost:8912 llama stack run /home/nvidia/.llama/distributions/starter/starter-run.yaml --image-type venv & [2] 3753042 (venv) nvidia@nv-meta-H100-testing-gpu01:~/kai/llama-stack$ WARNING 2025-10-07 00:29:09,848 root:266 uncategorized: Unknown logging category: openai::conversations. Falling back to default 'root' level: 20 WARNING 2025-10-07 00:29:09,932 root:266 uncategorized: Unknown logging category: cli. Falling back to default 'root' level: 20 INFO 2025-10-07 00:29:09,937 llama_stack.core.utils.config_resolution:45 core: Using file path: /home/nvidia/.llama/distributions/starter/starter-run.yaml INFO 2025-10-07 00:29:09,937 llama_stack.cli.stack.run:136 cli: Using run configuration: /home/nvidia/.llama/distributions/starter/starter-run.yaml Using virtual environment: /home/nvidia/kai/venv Virtual environment already activated + '[' -n /home/nvidia/.llama/distributions/starter/starter-run.yaml ']' + yaml_config_arg=/home/nvidia/.llama/distributions/starter/starter-run.yaml + llama stack run /home/nvidia/.llama/distributions/starter/starter-run.yaml --port 8321 WARNING 2025-10-07 00:29:11,432 root:266 uncategorized: Unknown logging category: openai::conversations. Falling back to default 'root' level: 20 WARNING 2025-10-07 00:29:11,593 root:266 uncategorized: Unknown logging category: cli. Falling back to default 'root' level: 20 INFO 2025-10-07 00:29:11,603 llama_stack.core.utils.config_resolution:45 core: Using file path: /home/nvidia/.llama/distributions/starter/starter-run.yaml INFO 2025-10-07 00:29:11,604 llama_stack.cli.stack.run:136 cli: Using run configuration: /home/nvidia/.llama/distributions/starter/starter-run.yaml INFO 2025-10-07 00:29:11,624 llama_stack.cli.stack.run:155 cli: No image type or image name provided. Assuming environment packages. INFO 2025-10-07 00:29:11,625 llama_stack.core.utils.config_resolution:45 core: Using file path: /home/nvidia/.llama/distributions/starter/starter-run.yaml INFO 2025-10-07 00:29:11,644 llama_stack.cli.stack.run:230 cli: HTTPS enabled with certificates: Key: None Cert: None INFO 2025-10-07 00:29:11,645 llama_stack.cli.stack.run:232 cli: Listening on ['::', '0.0.0.0']:8321 INFO 2025-10-07 00:29:11,816 llama_stack.core.utils.config_resolution:45 core: Using file path: /home/nvidia/.llama/distributions/starter/starter-run.yaml INFO 2025-10-07 00:29:11,836 llama_stack.core.server.server:480 core::server: Run configuration: INFO 2025-10-07 00:29:11,845 llama_stack.core.server.server:483 core::server: apis: - agents - batches - datasetio - eval - files - inference - post_training - safety - scoring - telemetry - tool_runtime - vector_io benchmarks: [] datasets: [] image_name: starter inference_store: db_path: /home/nvidia/.llama/distributions/starter/inference_store.db type: sqlite metadata_store: db_path: /home/nvidia/.llama/distributions/starter/registry.db type: sqlite models: [] providers: agents: - config: persistence_store: db_path: /home/nvidia/.llama/distributions/starter/agents_store.db type: sqlite responses_store: db_path: /home/nvidia/.llama/distributions/starter/responses_store.db type: sqlite provider_id: meta-reference provider_type: inline::meta-reference batches: - config: kvstore: db_path: /home/nvidia/.llama/distributions/starter/batches.db type: sqlite provider_id: reference provider_type: inline::reference datasetio: - config: kvstore: db_path: /home/nvidia/.llama/distributions/starter/huggingface_datasetio.db type: sqlite provider_id: huggingface provider_type: remote::huggingface - config: kvstore: db_path: /home/nvidia/.llama/distributions/starter/localfs_datasetio.db type: sqlite provider_id: localfs provider_type: inline::localfs eval: - config: kvstore: db_path: /home/nvidia/.llama/distributions/starter/meta_reference_eval.db type: sqlite provider_id: meta-reference provider_type: inline::meta-reference files: - config: metadata_store: db_path: /home/nvidia/.llama/distributions/starter/files_metadata.db type: sqlite storage_dir: /home/nvidia/.llama/distributions/starter/files provider_id: meta-reference-files provider_type: inline::localfs inference: - config: api_key: '********' url: https://api.fireworks.ai/inference/v1 provider_id: fireworks provider_type: remote::fireworks - config: api_key: '********' url: https://api.together.xyz/v1 provider_id: together provider_type: remote::together - config: {} provider_id: bedrock provider_type: remote::bedrock - config: api_key: '********' append_api_version: true url: http://localhost:8912 provider_id: nvidia provider_type: remote::nvidia - config: api_key: '********' base_url: https://api.openai.com/v1 provider_id: openai provider_type: remote::openai - config: api_key: '********' provider_id: anthropic provider_type: remote::anthropic - config: api_key: '********' provider_id: gemini provider_type: remote::gemini - config: api_key: '********' url: https://api.groq.com provider_id: groq provider_type: remote::groq - config: api_key: '********' url: https://api.sambanova.ai/v1 provider_id: sambanova provider_type: remote::sambanova - config: {} provider_id: sentence-transformers provider_type: inline::sentence-transformers post_training: - config: checkpoint_format: meta provider_id: torchtune-cpu provider_type: inline::torchtune-cpu safety: - config: excluded_categories: [] provider_id: llama-guard provider_type: inline::llama-guard - config: {} provider_id: code-scanner provider_type: inline::code-scanner scoring: - config: {} provider_id: basic provider_type: inline::basic - config: {} provider_id: llm-as-judge provider_type: inline::llm-as-judge - config: openai_api_key: '********' provider_id: braintrust provider_type: inline::braintrust telemetry: - config: service_name: "\u200B" sinks: sqlite sqlite_db_path: /home/nvidia/.llama/distributions/starter/trace_store.db provider_id: meta-reference provider_type: inline::meta-reference tool_runtime: - config: api_key: '********' max_results: 3 provider_id: brave-search provider_type: remote::brave-search - config: api_key: '********' max_results: 3 provider_id: tavily-search provider_type: remote::tavily-search - config: {} provider_id: rag-runtime provider_type: inline::rag-runtime - config: {} provider_id: model-context-protocol provider_type: remote::model-context-protocol vector_io: - config: kvstore: db_path: /home/nvidia/.llama/distributions/starter/faiss_store.db type: sqlite provider_id: faiss provider_type: inline::faiss - config: db_path: /home/nvidia/.llama/distributions/starter/sqlite_vec.db kvstore: db_path: /home/nvidia/.llama/distributions/starter/sqlite_vec_registry.db type: sqlite provider_id: sqlite-vec provider_type: inline::sqlite-vec scoring_fns: [] server: port: 8321 shields: [] tool_groups: - provider_id: tavily-search toolgroup_id: builtin::websearch - provider_id: rag-runtime toolgroup_id: builtin::rag vector_dbs: [] version: 2 INFO 2025-10-07 00:29:12,138 llama_stack.providers.remote.inference.nvidia.nvidia:49 inference::nvidia: Initializing NVIDIAInferenceAdapter(http://localhost:8912)... INFO 2025-10-07 00:29:12,921 llama_stack.providers.utils.inference.inference_store:74 inference: Write queue disabled for SQLite to avoid concurrency issues INFO 2025-10-07 00:29:13,524 llama_stack.providers.utils.responses.responses_store:96 openai_responses: Write queue disabled for SQLite to avoid concurrency issues ERROR 2025-10-07 00:29:13,679 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: FireworksInferenceAdapter.list_provider_model_ids() failed with: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"fireworks_api_key": "<API_KEY>"}, or in the provider config. WARNING 2025-10-07 00:29:13,681 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider fireworks: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"fireworks_api_key": "<API_KEY>"}, or in the provider config. ERROR 2025-10-07 00:29:13,682 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: TogetherInferenceAdapter.list_provider_model_ids() failed with: Pass Together API Key in the header X-LlamaStack-Provider-Data as { "together_api_key": <your api key>} WARNING 2025-10-07 00:29:13,684 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider together: Pass Together API Key in the header X-LlamaStack-Provider-Data as { "together_api_key": <your api key>} Handling connection for 8912 INFO 2025-10-07 00:29:14,047 llama_stack.providers.utils.inference.openai_mixin:448 providers::utils: NVIDIAInferenceAdapter.list_provider_model_ids() returned 3 models ERROR 2025-10-07 00:29:14,062 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: OpenAIInferenceAdapter.list_provider_model_ids() failed with: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"openai_api_key": "<API_KEY>"}, or in the provider config. WARNING 2025-10-07 00:29:14,063 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider openai: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"openai_api_key": "<API_KEY>"}, or in the provider config. ERROR 2025-10-07 00:29:14,099 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: AnthropicInferenceAdapter.list_provider_model_ids() failed with: "Could not resolve authentication method. Expected either api_key or auth_token to be set. Or for one of the `X-Api-Key` or `Authorization` headers to be explicitly omitted" WARNING 2025-10-07 00:29:14,100 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider anthropic: "Could not resolve authentication method. Expected either api_key or auth_token to be set. Or for one of the `X-Api-Key` or `Authorization` headers to be explicitly omitted" ERROR 2025-10-07 00:29:14,102 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: GeminiInferenceAdapter.list_provider_model_ids() failed with: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or in the provider config. WARNING 2025-10-07 00:29:14,103 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider gemini: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or in the provider config. ERROR 2025-10-07 00:29:14,105 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: GroqInferenceAdapter.list_provider_model_ids() failed with: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in the provider config. WARNING 2025-10-07 00:29:14,106 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider groq: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in the provider config. ERROR 2025-10-07 00:29:14,107 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: SambaNovaInferenceAdapter.list_provider_model_ids() failed with: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"}, or in the provider config. WARNING 2025-10-07 00:29:14,109 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider sambanova: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"}, or in the provider config. INFO 2025-10-07 00:29:14,454 uvicorn.error:84 uncategorized: Started server process [3753046] INFO 2025-10-07 00:29:14,455 uvicorn.error:48 uncategorized: Waiting for application startup. INFO 2025-10-07 00:29:14,457 llama_stack.core.server.server:170 core::server: Starting up INFO 2025-10-07 00:29:14,458 llama_stack.core.stack:415 core: starting registry refresh task ERROR 2025-10-07 00:29:14,459 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: FireworksInferenceAdapter.list_provider_model_ids() failed with: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"fireworks_api_key": "<API_KEY>"}, or in the provider config. WARNING 2025-10-07 00:29:14,461 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider fireworks: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"fireworks_api_key": "<API_KEY>"}, or in the provider config. ERROR 2025-10-07 00:29:14,462 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: TogetherInferenceAdapter.list_provider_model_ids() failed with: Pass Together API Key in the header X-LlamaStack-Provider-Data as { "together_api_key": <your api key>} WARNING 2025-10-07 00:29:14,463 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider together: Pass Together API Key in the header X-LlamaStack-Provider-Data as { "together_api_key": <your api key>} ERROR 2025-10-07 00:29:14,465 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: OpenAIInferenceAdapter.list_provider_model_ids() failed with: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"openai_api_key": "<API_KEY>"}, or in the provider config. WARNING 2025-10-07 00:29:14,466 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider openai: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"openai_api_key": "<API_KEY>"}, or in the provider config. INFO 2025-10-07 00:29:14,500 uvicorn.error:62 uncategorized: Application startup complete. ERROR 2025-10-07 00:29:14,502 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: AnthropicInferenceAdapter.list_provider_model_ids() failed with: "Could not resolve authentication method. Expected either api_key or auth_token to be set. Or for one of the `X-Api-Key` or `Authorization` headers to be explicitly omitted" WARNING 2025-10-07 00:29:14,503 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider anthropic: "Could not resolve authentication method. Expected either api_key or auth_token to be set. Or for one of the `X-Api-Key` or `Authorization` headers to be explicitly omitted" ERROR 2025-10-07 00:29:14,504 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: GeminiInferenceAdapter.list_provider_model_ids() failed with: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or in the provider config. WARNING 2025-10-07 00:29:14,506 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider gemini: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or in the provider config. ERROR 2025-10-07 00:29:14,507 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: GroqInferenceAdapter.list_provider_model_ids() failed with: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in the provider config. WARNING 2025-10-07 00:29:14,508 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider groq: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in the provider config. ERROR 2025-10-07 00:29:14,510 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: SambaNovaInferenceAdapter.list_provider_model_ids() failed with: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"}, or in the provider config. WARNING 2025-10-07 00:29:14,511 llama_stack.core.routing_tables.models:36 core::routing_tables: Model refresh failed for provider sambanova: API key is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"}, or in the provider config. INFO 2025-10-07 00:29:14,513 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) ``` tested with curl model, it also works: ``` curl http://localhost:8321/v1/models {"data":[{"identifier":"bedrock/meta.llama3-1-8b-instruct-v1:0","provider_resource_id":"meta.llama3-1-8b-instruct-v1:0","provider_id":"bedrock","type":"model","metadata":{},"model_type":"llm"},{"identifier":"bedrock/meta.llama3-1-70b-instruct-v1:0","provider_resource_id":"meta.llama3-1-70b-instruct-v1:0","provider_id":"bedrock","type":"model","metadata":{},"model_type":"llm"},{"identifier":"bedrock/meta.llama3-1-405b-instruct-v1:0","provider_resource_id":"meta.llama3-1-405b-instruct-v1:0","provider_id":"bedrock","type":"model","metadata":{},"model_type":"llm"},{"identifier":"nvidia/bigcode/starcoder2-7b","provider_resource_id":"bigcode/starcoder2-7b","provider_id":"nvidia","type":"model","metadata":{},"model_type":"llm"},{"identifier":"nvidia/meta/llama-3.3-70b-instruct","provider_resource_id":"meta/llama-3.3-70b-instruct","provider_id":"nvidia","type":"model","metadata":{},"model_type":"llm"},{"identifier":"nvidia/nvidia/llama-3.2-nv-embedqa-1b-v2","provider_resource_id":"nvidia/llama-3.2-nv-embedqa-1b-v2","provider_id":"nvidia","type":"model","metadata":{"embedding_dimension":2048,"context_length":8192},"model_type":"embedding"},{"identifier":"sentence-transformers/all-MiniLM-L6-v2","provider_resource_id":"all-MiniLM-L6-v2","provider_id":"sentence-transformers","type":"model","metadata":{"embedding_dimension":384},"model_type":"embedding"}]}% ``` --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> | ||
|  | c2d97a9db9 | chore: fix flaky unit test and add proper shutdown for file batches (#3725) # What does this PR do?
Have been running into flaky unit test failures:
 | ||
|  | e892a3f7f4 | feat: add refresh_models support to inference adapters (default: false) (#3719) # What does this PR do? inference adapters can now configure `refresh_models: bool` to control periodic model listing from their providers BREAKING CHANGE: together inference adapter default changed. previously always refreshed, now follows config. addresses "models: refresh" on #3517 ## Test Plan ci w/ new tests | ||
|  | 509ac4a659 | feat: enable Runpod inference adapter (#3707) 
		
			Some checks failed
		
		
	 SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 1s Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 2s Python Package Build Test / build (3.12) (push) Failing after 1s Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Python Package Build Test / build (3.13) (push) Failing after 1s Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 4s Vector IO Integration Tests / test-matrix (push) Failing after 5s Test External API and Providers / test-external (venv) (push) Failing after 3s Unit Tests / unit-tests (3.13) (push) Failing after 3s Unit Tests / unit-tests (3.12) (push) Failing after 4s API Conformance Tests / check-schema-compatibility (push) Successful in 11s UI Tests / ui-tests (22) (push) Successful in 30s Pre-commit / pre-commit (push) Successful in 1m24s # What does this PR do? Sorry to @mattf I thought I could close the other PR and reopen it.. But I didn't have the option to reopen it now. I just didn't want it to keep notifying maintainers if I would make other commits for testing. Continuation of: https://github.com/llamastack/llama-stack/pull/3641 PR fixes Runpod Adapter https://github.com/llamastack/llama-stack/issues/3517 ## What I fixed from before: Continuation of: https://github.com/llamastack/llama-stack/pull/3641 1. Made it all OpenAI 2. Fixed the class up since the OpenAIMixin had a couple changes with the pydantic base model stuff. 3. Test to make sure that we could dynamically find models and use the resulting identifier to make requests ```bash curl -X GET \ -H "Content-Type: application/json" \ "http://localhost:8321/v1/models" ``` ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* --> ``` # RunPod Provider Quick Start ## Prerequisites - Python 3.10+ - Git - RunPod API token ## Setup for Development ```bash # 1. Clone and enter the repository cd (into the repo) # 2. Create and activate virtual environment python3 -m venv .venv source .venv/bin/activate # 3. Remove any existing llama-stack installation pip uninstall llama-stack llama-stack-client -y # 4. Install llama-stack in development mode pip install -e . # 5. Build using local development code (Found this through the Discord) LLAMA_STACK_DIR=. llama stack build # When prompted during build: # - Name: runpod-dev # - Image type: venv # - Inference provider: remote::runpod # - Safety provider: "llama-guard" # - Other providers: first defaults ``` ## Configure the Stack The RunPod adapter automatically discovers models from your endpoint via the `/v1/models` API. No manual model configuration is required - just set your environment variables. ## Run the Server ### Important: Use the Build-Created Virtual Environment ```bash # Exit the development venv if you're in it deactivate # Activate the build-created venv (NOT .venv) cd (lama-stack folder github repo) source llamastack-runpod-dev/bin/activate ``` ### For Qwen3-32B-AWQ Public Endpoint (Recommended) ```bash # Set environment variables export RUNPOD_URL="https://api.runpod.ai/v2/qwen3-32b-awq/openai/v1" export RUNPOD_API_TOKEN="your_runpod_api_key" # Start server llama stack run ~/.llama/distributions/llamastack-runpod-dev/llamastack-runpod-dev-run.yaml ``` ## Quick Test ### 1. List Available Models (Dynamic Discovery) First, check which models are available on your RunPod endpoint: ```bash curl -X GET \ -H "Content-Type: application/json" \ "http://localhost:8321/v1/models" ``` **Example Response:** ```json { "data": [ { "identifier": "qwen3-32b-awq", "provider_resource_id": "Qwen/Qwen3-32B-AWQ", "provider_id": "runpod", "type": "model", "metadata": {}, "model_type": "llm" } ] } ``` **Note:** Use the `identifier` value from the response above in your requests below. ### 2. Chat Completion (Non-streaming) Replace `qwen3-32b-awq` with your model identifier from step 1: ```bash curl -X POST http://localhost:8321/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3-32b-awq", "messages": [{"role": "user", "content": "Hello, count to 3"}], "stream": false }' ``` ### 3. Chat Completion (Streaming) ```bash curl -X POST http://localhost:8321/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3-32b-awq", "messages": [{"role": "user", "content": "Count to 5"}], "stream": true }' ``` **Clean streaming output:** ```bash curl -N -X POST http://localhost:8321/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "qwen3-32b-awq", "messages": [{"role": "user", "content": "Count to 5"}], "stream": true}' \ 2>/dev/null | while read -r line; do echo "$line" | grep "^data: " | sed 's/^data: //' | jq -r '.choices[0].delta.content // empty' 2>/dev/null done ``` **Expected Output:** ``` 1 2 3 4 5 ``` | ||
|  | bba9957edd | feat(api): Add vector store file batches api (#3642) 
		
			Some checks failed
		
		
	 SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 2s Python Package Build Test / build (3.13) (push) Failing after 0s Python Package Build Test / build (3.12) (push) Failing after 2s SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 5s Vector IO Integration Tests / test-matrix (push) Failing after 4s API Conformance Tests / check-schema-compatibility (push) Successful in 9s Unit Tests / unit-tests (3.12) (push) Failing after 3s Test External API and Providers / test-external (venv) (push) Failing after 5s Unit Tests / unit-tests (3.13) (push) Failing after 3s UI Tests / ui-tests (22) (push) Successful in 40s Pre-commit / pre-commit (push) Successful in 1m28s # What does this PR do? Add Open AI Compatible vector store file batches api. This functionality is needed to attach many files to a vector store as a batch. https://github.com/llamastack/llama-stack/issues/3533 API Stubs have been merged https://github.com/llamastack/llama-stack/pull/3615 Adds persistence for file batches as discussed in diff https://github.com/llamastack/llama-stack/pull/3544 (Used claude code for generation and reviewed by me) ## Test Plan 1. Unit tests pass 2. Also verified the cc-vec integration with LLamaStackClient works with the file batches api. https://github.com/raghotham/cc-vec 2. Integration tests pass | ||
|  | 892ea759fa | chore: remove together inference adapter's custom check_model_availability (#3702) # What does this PR do? remove Together inference adapter's check_model_availability impl, rely on standard impl instead ## Test Plan ci | ||
|  | de9940c697 | chore: disable openai_embeddings on inference=remote::llama-openai-compat (#3704) # What does this PR do? api.llama.com does not provide embedding models, this makes that clear ## Test Plan ci | ||
|  | ae74b31ae3 | chore: remove vLLM inference adapter's custom list_models (#3703) # What does this PR do? remove vLLM inference adapter's custom list_models impl, rely on standard impl instead ## Test Plan ci | ||
|  | d23ed26238 | chore: turn OpenAIMixin into a pydantic.BaseModel (#3671) # What does this PR do? - implement get_api_key instead of relying on LiteLLMOpenAIMixin.get_api_key - remove use of LiteLLMOpenAIMixin - add default initialize/shutdown methods to OpenAIMixin - remove __init__s to allow proper pydantic construction - remove dead code from vllm adapter and associated / duplicate unit tests - update vllm adapter to use openaimixin for model registration - remove ModelRegistryHelper from fireworks & together adapters - remove Inference from nvidia adapter - complete type hints on embedding_model_metadata - allow extra fields on OpenAIMixin, for model_store, __provider_id__, etc - new recordings for ollama - enhance the list models error handling - update cerebras (remove cerebras-cloud-sdk) and anthropic (custom model listing) inference adapters - parametrized test_inference_client_caching - remove cerebras, databricks, fireworks, together from blanket mypy exclude - removed unnecessary litellm deps ## Test Plan ci | ||
|  | 724dac498c | chore: give OpenAIMixin subcalsses a change to list models without leaking _model_cache details (#3682) # What does this PR do? close the _model_cache abstraction leak ## Test Plan ci w/ new tests | ||
|  | 351c4b98e4 | chore: inference=remote::llama-openai-compat does not support /v1/completion (#3683) 
		
			Some checks failed
		
		
	 Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Vector IO Integration Tests / test-matrix (push) Failing after 4s Test External API and Providers / test-external (venv) (push) Failing after 4s API Conformance Tests / check-schema-compatibility (push) Successful in 8s SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 17s Python Package Build Test / build (3.13) (push) Failing after 16s SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 19s Python Package Build Test / build (3.12) (push) Failing after 18s Unit Tests / unit-tests (3.13) (push) Failing after 16s Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 20s Unit Tests / unit-tests (3.12) (push) Failing after 18s UI Tests / ui-tests (22) (push) Successful in 44s Pre-commit / pre-commit (push) Successful in 1m22s ## What does this PR do? skip completion tests for inference=remote::llama-openai-compat ## Test Plan ci | ||
|  | ce77c27ff8 | chore: use remoteinferenceproviderconfig for remote inference providers (#3668) # What does this PR do? on the path to maintainable impls of inference providers. make all configs instances of RemoteInferenceProviderConfig. ## Test Plan ci | ||
|  | d266c59c2a | chore: remove deprecated inference.chat_completion implementations (#3654) # What does this PR do? remove unused chat_completion implementations vllm features ported - - requires max_tokens be set, use config value - set tool_choice to none if no tools provided ## Test Plan ci | ||
|  | bcdbb53be3 | feat: implement keyword and hybrid search for Weaviate provider (#3264) # What does this PR do? <!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. --> - This PR implements keyword and hybrid search for Weaviate DB based on its inbuilt functions. - Added fixtures to conftest.py for Weaviate. - Enabled integration tests for remote Weaviate on all 3 search modes. <!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> Closes #3010 ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* --> Unit tests and integration tests should pass on this PR. | ||
|  | 0a41c4ead0 | chore: OpenAIMixin implements ModelsProtocolPrivate (#3662) # What does this PR do? add ModelsProtocolPrivate methods to OpenAIMixin this will allow providers using OpenAIMixin to use a common interface ## Test Plan ci w/ new tests | ||
|  | ef0736527d | feat(tools)!: substantial clean up of "Tool" related datatypes (#3627) This is a sweeping change to clean up some gunk around our "Tool" definitions. First, we had two types `Tool` and `ToolDef`. The first of these was a "Resource" type for the registry but we had stopped registering tools inside the Registry long back (and only registered ToolGroups.) The latter was for specifying tools for the Agents API. This PR removes the former and adds an optional `toolgroup_id` field to the latter. Secondly, as pointed out by @bbrowning in https://github.com/llamastack/llama-stack/pull/3003#issuecomment-3245270132, we were doing a lossy conversion from a full JSON schema from the MCP tool specification into our ToolDefinition to send it to the model. There is no necessity to do this -- we ourselves aren't doing any execution at all but merely passing it to the chat completions API which supports this. By doing this (and by doing it poorly), we encountered limitations like not supporting array items, or not resolving $refs, etc. To fix this, we replaced the `parameters` field by `{ input_schema, output_schema }` which can be full blown JSON schemas. Finally, there were some types in our llama-related chat format conversion which needed some cleanup. We are taking this opportunity to clean those up. This PR is a substantial breaking change to the API. However, given our window for introducing breaking changes, this suits us just fine. I will be landing a concurrent `llama-stack-client` change as well since API shapes are changing. | ||
|  | f7c5ef4ec0 | chore: remove /v1/inference/completion and implementations (#3622) # What does this PR do? the /inference/completion route is gone. this removes the implementations. ## Test Plan ci | ||
|  | 606f4cf281 | fix(expires_after): make sure multipart/form-data is properly parsed (#3612) https://github.com/llamastack/llama-stack/pull/3604 broke multipart form data field parsing for the Files API since it changed its shape -- so as to match the API exactly to the OpenAI spec even in the generated client code. The underlying reason is that multipart/form-data cannot transport structured nested fields. Each field must be str-serialized. The client (specifically the OpenAI client whose behavior we must match), transports sub-fields as `expires_after[anchor]` and `expires_after[seconds]`, etc. We must be able to handle these fields somehow on the server without compromising the shape of the YAML spec. This PR "fixes" this by adding a dependency to convert the data. The main trade-off here is that we must add this `Depends()` annotation on every provider implementation for Files. This is a headache, but a much more reasonable one (in my opinion) given the alternatives. ## Test Plan Tests as shown in https://github.com/llamastack/llama-stack/pull/3604#issuecomment-3351090653 pass. | ||
|  | cb33f45c11 | chore: unpublish /inference/chat-completion (#3609) # What does this PR do?
BREAKING CHANGE: removes /inference/chat-completion route and updates
relevant documentation
## Test Plan
🤷 | ||
|  | 3a09f00cdb | feat(files): fix expires_after API shape (#3604) This was just quite incorrect. See source here: https://platform.openai.com/docs/api-reference/files/create | ||
|  | e9eb004bf8 | fix: remove inference.completion from docs (#3589) # What does this PR do? now that /v1/inference/completion has been removed, no docs should refer to it this cleans up remaining references ## Test Plan ci Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com> | ||
|  | 975ead1d6a | chore(api): remove deprecated embeddings impls (#3301) 
		
			Some checks failed
		
		
	 SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 1s Python Package Build Test / build (3.12) (push) Failing after 1s Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s Vector IO Integration Tests / test-matrix (push) Failing after 4s API Conformance Tests / check-schema-compatibility (push) Successful in 7s Unit Tests / unit-tests (3.13) (push) Failing after 4s Test External API and Providers / test-external (venv) (push) Failing after 4s Python Package Build Test / build (3.13) (push) Failing after 9s Unit Tests / unit-tests (3.12) (push) Failing after 10s UI Tests / ui-tests (22) (push) Successful in 39s Pre-commit / pre-commit (push) Successful in 1m25s # What does this PR do? remove deprecated embeddings implementations | ||
|  | 0d94f3e2c0 | chore: recordings for fireworks (inference + openai) (#3573) # What does this PR do? recorded for: ./scripts/integration-tests.sh --stack-config server:ci-tests --suite base --setup fireworks --subdirs inference --pattern openai ## Test Plan ./scripts/integration-tests.sh --stack-config server:ci-tests --suite base --setup fireworks --subdirs inference --pattern openai | ||
|  | 60484c5c4e | chore(api): remove batch inference (#3261) 
		
			Some checks failed
		
		
	 Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s Vector IO Integration Tests / test-matrix (push) Failing after 4s Test Llama Stack Build / build-custom-container-distribution (push) Failing after 4s Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 3s Unit Tests / unit-tests (3.12) (push) Failing after 3s Unit Tests / unit-tests (3.13) (push) Failing after 3s Test Llama Stack Build / build (push) Failing after 3s SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 1s Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s Test Llama Stack Build / generate-matrix (push) Successful in 3s Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Test Llama Stack Build / build-single-provider (push) Failing after 4s Python Package Build Test / build (3.12) (push) Failing after 1s API Conformance Tests / check-schema-compatibility (push) Successful in 7s Python Package Build Test / build (3.13) (push) Failing after 1s Test External API and Providers / test-external (venv) (push) Failing after 4s UI Tests / ui-tests (22) (push) Successful in 39s Pre-commit / pre-commit (push) Successful in 1m18s # What does this PR do? APIs removed: - POST /v1/batch-inference/completion - POST /v1/batch-inference/chat-completion - POST /v1/inference/batch-completion - POST /v1/inference/batch-chat-completion note - - batch-completion & batch-chat-completion were only implemented for inference=inline::meta-reference - batch-inference were not implemented | ||
|  | b48d5cfed7 | feat(internal): add image_url download feature to OpenAIMixin (#3516) # What does this PR do? simplify Ollama inference adapter by - - moving image_url download code to OpenAIMixin - being a ModelRegistryHelper instead of having one (mypy blocks check_model_availability method assignment) ## Test Plan - add unit tests for new download feature - add integration tests for openai_chat_completion w/ image_url (close test gap) | ||
|  | da5ea107fc | fix: ensure ModelRegistryHelper init for together and fireworks (#3572) # What does this PR do?
address -
```
ERROR    2025-09-26 10:44:29,450 main:527 core::server: Error creating app: 'FireworksInferenceAdapter' object has no attribute
         'alias_to_provider_id_map'
```
## Test Plan
manual startup w/ valid together & fireworks api keys | ||
|  | 926c3ada41 | chore: prune mypy exclude list (#3561) # What does this PR do? prune the mypy exclude list, build a stronger foundation for quality code ## Test Plan ci | ||
|  | 65e01b5684 | feat: together now supports base64 embedding encoding (#3559) # What does this PR do? use together's new base64 support ## Test Plan recordings for: ./scripts/integration-tests.sh --stack-config server:ci-tests --suite base --setup together --subdirs inference --pattern openai | ||
|  | b67aef2fc4 | feat: add static embedding metadata to dynamic model listings for providers using OpenAIMixin (#3547) # What does this PR do? - remove auto-download of ollama embedding models - add embedding model metadata to dynamic listing w/ unit test - add support and tests for allowed_models - removed inference provider models.py files where dynamic listing is enabled - store embedding metadata in embedding_model_metadata field on inference providers - make model_entries optional on ModelRegistryHelper and LiteLLMOpenAIMixin - make OpenAIMixin a ModelRegistryHelper - skip base64 embedding test for remote::ollama, always returns floats - only use OpenAI client for ollama model listing - remove unused build_model_entry function - remove unused get_huggingface_repo function ## Test Plan ci w/ new tests | ||
|  | ce7a3b4dff | feat: update Cerebras inference provider to support dynamic model listing (#3481) # What does this PR do? - update Cerebras to use OpenAIMixin - enable openai completions tests - enable openai chat completions tests - disable with n > 1 tests - add recording for --setup cerebras --subdirs inference --pattern openai ## Test Plan `./scripts/integration-tests.sh --stack-config server:ci-tests --setup cerebras --subdirs inference --pattern openai` ``` tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming[txt=cerebras/llama-3.3-70b-inference:completion:sanity] instantiating llama_stack_client Port 8321 is already in use, assuming server is already running... llama_stack_client instantiated in 0.053s PASSED [ 2%] tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming_suffix[txt=cerebras/llama-3.3-70b-inference:completion:suffix] SKIPPED (Suffix is not supported for the model: cerebras/llama-3.3-70b.) [ 4%] tests/integration/inference/test_openai_completion.py::test_openai_completion_streaming[txt=cerebras/llama-3.3-70b-inference:completion:sanity] PASSED [ 6%] tests/integration/inference/test_openai_completion.py::test_openai_completion_prompt_logprobs[txt=cerebras/llama-3.3-70b-1] SKIPPED (Model cerebras/llama-3.3-70b hosted by remote::cerebras doesn't support vllm extra_body parameters.) [ 8%] tests/integration/inference/test_openai_completion.py::test_openai_completion_guided_choice[txt=cerebras/llama-3.3-70b] SKIPPED (Model cerebras/llama-3.3-70b hosted by remote::cerebras doesn't support vllm extra_body parameters.) [ 10%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=cerebras/llama-3.3-70b-inference:chat_completion:non_streaming_01] PASSED [ 12%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[openai_client-txt=cerebras/llama-3.3-70b-inference:chat_completion:streaming_01] PASSED [ 14%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[openai_client-txt=cerebras/llama-3.3-70b-inference:chat_completion:streaming_01] SKIPPED (Model cerebras/llama-3.3-70b hosted by remote::cere...) [ 17%] tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-txt=cerebras/llama-3.3-70b-True] PASSED [ 19%] tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=cerebras/llama-3.3-70b-True] PASSED [ 21%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming_with_file[txt=cerebras/llama-3.3-70b] SKIPPED (Model cerebras/llama-3.3-70b hosted by remote::cerebras doesn't support chat completion calls wit...) [ 23%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_single_string[openai_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 25%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_multiple_strings[openai_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 27%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_encoding_format_float[openai_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 29%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_dimensions[openai_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 31%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_user_parameter[openai_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 34%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_empty_list_error[openai_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 36%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_invalid_model_error[openai_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 38%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_different_inputs_different_outputs[openai_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 40%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_encoding_format_base64[openai_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 42%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_base64_batch_processing[openai_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 44%] tests/integration/inference/test_openai_completion.py::test_openai_completion_prompt_logprobs[txt=cerebras/llama-3.3-70b-0] SKIPPED (Model cerebras/llama-3.3-70b hosted by remote::cerebras doesn't support vllm extra_body parameters.) [ 46%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=cerebras/llama-3.3-70b-inference:chat_completion:non_streaming_02] PASSED [ 48%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[openai_client-txt=cerebras/llama-3.3-70b-inference:chat_completion:streaming_02] PASSED [ 51%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[openai_client-txt=cerebras/llama-3.3-70b-inference:chat_completion:streaming_02] SKIPPED (Model cerebras/llama-3.3-70b hosted by remote::cere...) [ 53%] tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-txt=cerebras/llama-3.3-70b-False] PASSED [ 55%] tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=cerebras/llama-3.3-70b-False] PASSED [ 57%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_single_string[llama_stack_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 59%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_multiple_strings[llama_stack_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 61%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_encoding_format_float[llama_stack_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 63%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_dimensions[llama_stack_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 65%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_user_parameter[llama_stack_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 68%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_empty_list_error[llama_stack_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 70%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_invalid_model_error[llama_stack_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 72%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_different_inputs_different_outputs[llama_stack_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 74%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_encoding_format_base64[llama_stack_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 76%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_base64_batch_processing[llama_stack_client-cerebras/llama-3.3-70b-None-None-None-384] SKIPPED (embedding_model_id empty - skipping test) [ 78%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=cerebras/llama-3.3-70b-inference:chat_completion:non_streaming_01] PASSED [ 80%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[client_with_models-txt=cerebras/llama-3.3-70b-inference:chat_completion:streaming_01] PASSED [ 82%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[client_with_models-txt=cerebras/llama-3.3-70b-inference:chat_completion:streaming_01] SKIPPED (Model cerebras/llama-3.3-70b hosted by remote:...) [ 85%] tests/integration/inference/test_openai_completion.py::test_inference_store[client_with_models-txt=cerebras/llama-3.3-70b-True] PASSED [ 87%] tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=cerebras/llama-3.3-70b-True] PASSED [ 89%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=cerebras/llama-3.3-70b-inference:chat_completion:non_streaming_02] PASSED [ 91%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[client_with_models-txt=cerebras/llama-3.3-70b-inference:chat_completion:streaming_02] PASSED [ 93%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[client_with_models-txt=cerebras/llama-3.3-70b-inference:chat_completion:streaming_02] SKIPPED (Model cerebras/llama-3.3-70b hosted by remote:...) [ 95%] tests/integration/inference/test_openai_completion.py::test_inference_store[client_with_models-txt=cerebras/llama-3.3-70b-False] PASSED [ 97%] tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=cerebras/llama-3.3-70b-False] PASSED [100%] =================================================================================================================== slowest 10 durations ==================================================================================================================== 0.37s call tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=cerebras/llama-3.3-70b-inference:chat_completion:non_streaming_01] 0.34s call tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-txt=cerebras/llama-3.3-70b-False] 0.18s call tests/integration/inference/test_openai_completion.py::test_inference_store[client_with_models-txt=cerebras/llama-3.3-70b-True] 0.17s setup tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming[txt=cerebras/llama-3.3-70b-inference:completion:sanity] 0.15s call tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=cerebras/llama-3.3-70b-True] 0.13s call tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=cerebras/llama-3.3-70b-True] 0.12s call tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=cerebras/llama-3.3-70b-False] 0.12s call tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-txt=cerebras/llama-3.3-70b-True] 0.12s call tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=cerebras/llama-3.3-70b-False] 0.08s call tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[client_with_models-txt=cerebras/llama-3.3-70b-inference:chat_completion:streaming_02] ================================================================================================================== short test summary info ================================================================================================================== SKIPPED [1] tests/integration/inference/test_openai_completion.py:75: Suffix is not supported for the model: cerebras/llama-3.3-70b. SKIPPED [3] tests/integration/inference/test_openai_completion.py:123: Model cerebras/llama-3.3-70b hosted by remote::cerebras doesn't support vllm extra_body parameters. SKIPPED [4] tests/integration/inference/test_openai_completion.py:103: Model cerebras/llama-3.3-70b hosted by remote::cerebras doesn't support n param. SKIPPED [1] tests/integration/inference/test_openai_completion.py:129: Model cerebras/llama-3.3-70b hosted by remote::cerebras doesn't support chat completion calls with base64 encoded files. SKIPPED [2] tests/integration/inference/test_openai_embeddings.py:90: embedding_model_id empty - skipping test SKIPPED [2] tests/integration/inference/test_openai_embeddings.py:112: embedding_model_id empty - skipping test SKIPPED [2] tests/integration/inference/test_openai_embeddings.py:136: embedding_model_id empty - skipping test SKIPPED [2] tests/integration/inference/test_openai_embeddings.py:154: embedding_model_id empty - skipping test SKIPPED [2] tests/integration/inference/test_openai_embeddings.py:175: embedding_model_id empty - skipping test SKIPPED [2] tests/integration/inference/test_openai_embeddings.py:195: embedding_model_id empty - skipping test SKIPPED [2] tests/integration/inference/test_openai_embeddings.py:206: embedding_model_id empty - skipping test SKIPPED [2] tests/integration/inference/test_openai_embeddings.py:217: embedding_model_id empty - skipping test SKIPPED [2] tests/integration/inference/test_openai_embeddings.py:244: embedding_model_id empty - skipping test SKIPPED [2] tests/integration/inference/test_openai_embeddings.py:278: embedding_model_id empty - skipping test ================================================================================================= 18 passed, 29 skipped, 50 deselected, 4 warnings in 3.02s ================================================================================================= ``` | ||
|  | d07ebce4d9 | feat: (re-)enable Databricks inference adapter (#3500) # What does this PR do? add/enable the Databricks inference adapter Databricks inference adapter was broken, closes #3486 - remove deprecated completion / chat_completion endpoints - enable dynamic model listing w/o refresh, listing is not async - use SecretStr instead of str for token - backward incompatible change: for consistency with databricks docs, env DATABRICKS_URL -> DATABRICKS_HOST and DATABRICKS_API_TOKEN -> DATABRICKS_TOKEN - databricks urls are custom per user/org, add special recorder handling for databricks urls - add integration test --setup databricks - enable chat completions tests - enable embeddings tests - disable n > 1 tests - disable embeddings base64 tests - disable embeddings dimensions tests note: reasoning models, e.g. gpt oss, fail because databricks has a custom, incompatible response format ## Test Plan ci and ``` ./scripts/integration-tests.sh --stack-config server:ci-tests --setup databricks --subdirs inference --pattern openai ``` note: databricks needs to be manually added to the ci-tests distro for replay testing | ||
|  | 2be869b3ef | fix(dev): fix vllm inference recording (await models.list) (#3524) # What does this PR do? fix inference recording for vLLM closes #3523 ## Test Plan ``` $ ./scripts/integration-tests.sh --stack-config server:ci-tests --setup vllm --subdirs inference --inference-mode record --pattern test_text_chat_completion_non_streaming === Llama Stack Integration Test Runner === Stack Config: server:ci-tests Setup: vllm Inference Mode: record Test Suite: base Test Subdirs: inference Test Pattern: test_text_chat_completion_non_streaming ... === Applying Setup Environment Variables === Setting up environment variables: export VLLM_URL='http://localhost:8000/v1' === Starting Llama Stack Server === Waiting for Llama Stack Server to start... ✅ Llama Stack Server started successfully === Running Integration Tests === Test subdirs to run: inference Added test files from inference: 6 files === Running all collected tests in a single pytest command === Total test files: 6 + pytest -s -v tests/integration/inference/test_openai_completion.py tests/integration/inference/test_batch_inference.py tests/integration/inference/test_openai_embeddings.py tests/integration/inference/test_text_inference.py tests/integration/inference/test_vision_inference.py tests/integration/inference/test_embedding.py --stack-config=server:ci-tests --inference-mode=record -k 'not( builtin_tool or safety_with_image or code_interpreter or test_rag or test_inference_store_tool_calls ) and test_text_chat_completion_non_streaming' --setup=vllm --color=yes --capture=tee-sys INFO 2025-09-23 10:35:36,662 tests.integration.conftest:86 tests: Applying setup 'vllm' ======================================================= test session starts ======================================================= platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0 -- .../.venv/bin/python3 cachedir: .pytest_cache metadata: {'Python': '3.12.11', 'Platform': 'Linux-6.16.7-200.fc42.x86_64-x86_64-with-glibc2.41', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'html': '4.1.1', 'anyio': '4.9.0', 'timeout': '2.4.0', 'cov': '6.2.1', 'asyncio': '1.1.0', 'nbval': '0.11.0', 'socket': '0.7.0', 'json-report': '1.5.0', 'metadata': '3.1.1'}} rootdir: ... configfile: pyproject.toml plugins: html-4.1.1, anyio-4.9.0, timeout-2.4.0, cov-6.2.1, asyncio-1.1.0, nbval-0.11.0, socket-0.7.0, json-report-1.5.0, metadata-3.1.1 asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function collected 97 items / 95 deselected / 2 selected tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_01] instantiating llama_stack_client Port 8321 is already in use, assuming server is already running... llama_stack_client instantiated in 0.044s PASSED [ 50%] tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_02] PASSED [100%] ====================================================== slowest 10 durations ======================================================= 1.62s call tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_02] 0.93s call tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_01] 0.62s setup tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_01] (3 durations < 0.005s hidden. Use -vv to show these durations.) ========================================== 2 passed, 95 deselected, 6 warnings in 3.26s =========================================== + exit_code=0 + set +x ✅ All tests completed successfully ``` ``` $ git status ... Untracked files: (use "git add <file>..." to include in what will be committed) tests/integration/recordings/responses/032f8c5a1289.json tests/integration/recordings/responses/c42baf6a3700.json tests/integration/recordings/responses/models-bd032f995f2a-fb68f5a6.json ... ``` | ||
|  | 8d8261961e | chore: Refactor fireworks to use OpenAIMixin (#3480) 
		
			Some checks failed
		
		
	 Python Package Build Test / build (3.12) (push) Failing after 2s Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 1s Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 4s Python Package Build Test / build (3.13) (push) Failing after 2s API Conformance Tests / check-schema-compatibility (push) Successful in 6s Vector IO Integration Tests / test-matrix (push) Failing after 4s Unit Tests / unit-tests (3.12) (push) Failing after 3s Test External API and Providers / test-external (venv) (push) Failing after 6s Unit Tests / unit-tests (3.13) (push) Failing after 4s UI Tests / ui-tests (22) (push) Successful in 38s Pre-commit / pre-commit (push) Successful in 1m17s # What does this PR do? Refactor Fireworks to use OpenAIMixin Closes https://github.com/llamastack/llama-stack/issues/3391 Related to https://github.com/llamastack/llama-stack/issues/3387 ## Test Plan ``` (llama-stack) (base) swapna942@swapna942-mac llama-stack % FIREWORKS_API_KEY=**** ./scripts/integration-tests.sh --stack-config server:ci-tests --setup fireworks --subdirs inference --pattern openai tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_single_string[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] instantiating llama_stack_client Port 8321 is already in use, assuming server is already running... llama_stack_client instantiated in 0.031s PASSED [ 2%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_multiple_strings[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] PASSED [ 4%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_encoding_format_float[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] PASSED [ 6%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_dimensions[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] PASSED [ 8%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_user_parameter[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] SKIPPED [ 10%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_empty_list_error[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] PASSED [ 12%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_invalid_model_error[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] PASSED [ 14%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_different_inputs_different_outputs[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] PASSED [ 17%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_encoding_format_base64[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] SKIPPED [ 19%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_base64_batch_processing[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] SKIPPED [ 21%] tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming[txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:completion:sanity] PASSED [ 23%] tests/integration/inference/test_openai_completion.py::test_openai_completion_non_streaming_suffix[txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:completion:suffix] SKIPPED [ 25%] tests/integration/inference/test_openai_completion.py::test_openai_completion_streaming[txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:completion:sanity] PASSED [ 27%] tests/integration/inference/test_openai_completion.py::test_openai_completion_prompt_logprobs[txt=accounts/fireworks/models/llama-v3p1-8b-instruct-1] SKIPPED [ 29%] tests/integration/inference/test_openai_completion.py::test_openai_completion_guided_choice[txt=accounts/fireworks/models/llama-v3p1-8b-instruct] SKIPPED [ 31%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:chat_completion:non_streaming_01] PASSED [ 34%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[openai_client-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:chat_completion:streaming_01] PASSED [ 36%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[openai_client-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:chat_completion:streaming_01] PASSED [ 38%] tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-True] PASSED [ 40%] tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-True] PASSED [ 42%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming_with_file[txt=accounts/fireworks/models/llama-v3p1-8b-instruct] SKIPPED [ 44%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_single_string[llama_stack_client-emb=nomic-ai/nomic-embed-text-v1.5] PASSED [ 46%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_multiple_strings[llama_stack_client-emb=nomic-ai/nomic-embed-text-v1.5] PASSED [ 48%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_encoding_format_float[llama_stack_client-emb=nomic-ai/nomic-embed-text-v1.5] PASSED [ 51%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_dimensions[llama_stack_client-emb=nomic-ai/nomic-embed-text-v1.5] PASSED [ 53%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_user_parameter[llama_stack_client-emb=nomic-ai/nomic-embed-text-v1.5] SKIPPED [ 55%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_empty_list_error[llama_stack_client-emb=nomic-ai/nomic-embed-text-v1.5] PASSED [ 57%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_invalid_model_error[llama_stack_client-emb=nomic-ai/nomic-embed-text-v1.5] PASSED [ 59%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_different_inputs_different_outputs[llama_stack_client-emb=nomic-ai/nomic-embed-text-v1.5] PASSED [ 61%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_encoding_format_base64[llama_stack_client-emb=nomic-ai/nomic-embed-text-v1.5] SKIPPED [ 63%] tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_base64_batch_processing[llama_stack_client-emb=nomic-ai/nomic-embed-text-v1.5] SKIPPED [ 65%] tests/integration/inference/test_openai_completion.py::test_openai_completion_prompt_logprobs[txt=accounts/fireworks/models/llama-v3p1-8b-instruct-0] SKIPPED [ 68%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:chat_completion:non_streaming_02] PASSED [ 70%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[openai_client-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:chat_completion:streaming_02] PASSED [ 72%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[openai_client-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:chat_completion:streaming_02] PASSED [ 74%] tests/integration/inference/test_openai_completion.py::test_inference_store[openai_client-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-False] PASSED [ 76%] tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-False] PASSED [ 78%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:chat_completion:non_streaming_01] PASSED [ 80%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[client_with_models-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:chat_completion:streaming_01] PASSED [ 82%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[client_with_models-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:chat_completion:streaming_01] PASSED [ 85%] tests/integration/inference/test_openai_completion.py::test_inference_store[client_with_models-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-True] PASSED [ 87%] tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-True] PASSED [ 89%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:chat_completion:non_streaming_02] PASSED [ 91%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming[client_with_models-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:chat_completion:streaming_02] PASSED [ 93%] tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_streaming_with_n[client_with_models-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:chat_completion:streaming_02] PASSED [ 95%] tests/integration/inference/test_openai_completion.py::test_inference_store[client_with_models-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-False] PASSED [ 97%] tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-False] PASSED [100%] ========================================== slowest 10 durations ========================================== 30.01s teardown tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_multiple_strings[llama_stack_client-emb=nomic-ai/nomic-embed-text-v1.5] 30.01s teardown tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[client_with_models-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-False] 30.01s teardown tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_different_inputs_different_outputs[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] 30.01s teardown tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_with_user_parameter[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] 30.01s teardown tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[openai_client-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-True] 30.01s teardown tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_different_inputs_different_outputs[llama_stack_client-emb=nomic-ai/nomic-embed-text-v1.5] 30.01s teardown tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=accounts/fireworks/models/llama-v3p1-8b-instruct-inference:chat_completion:non_streaming_02] 30.01s teardown tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_single_string[llama_stack_client-emb=nomic-ai/nomic-embed-text-v1.5] 30.01s teardown tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_base64_batch_processing[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] 30.01s teardown tests/integration/inference/test_openai_embeddings.py::test_openai_embeddings_invalid_model_error[openai_client-emb=nomic-ai/nomic-embed-text-v1.5] ================= 36 passed, 11 skipped, 50 deselected, 4 warnings in 1429.05s (0:23:49) ================= + exit_code=0 + set +x ✅ All tests completed successfully ``` |