Replace deprecated `post /v1/models` with `get /v1/models` in the headline
example to fix Stainless Endpoint/NotFound error.
Signed-off-by: Sébastien Han <seb@redhat.com>
Filter out deprecated endpoints from the combined OpenAPI spec and remove
their references from the Stainless config to fix Endpoint/NotFound
errors.
Signed-off-by: Sébastien Han <seb@redhat.com>
The error is that Query default values can't be set in Annotated; they
must be set with = in the function signature.
See the error:
```
The error is that Query default values can't be set in Annotated; they
must be set with = in the function signature. Searching for where
include_embeddings is defined:
```
Signed-off-by: Sébastien Han <seb@redhat.com>
Added _add_titles_to_unions() to:
Recursively scan all schemas for anyOf/oneOf unions
Generate descriptive titles from the union members
Add those titles to help code generators infer names
Signed-off-by: Sébastien Han <seb@redhat.com>
The _filter_combined_schema function was excluding deprecated
operations. I updated it to include all operations (deprecated and
non-deprecated) for the combined/stainless spec, so these deprecated
endpoints are now included.
Signed-off-by: Sébastien Han <seb@redhat.com>
_filter_combined_schema was using path-level filtering with
_is_path_deprecated, which excluded entire paths if any operation was
deprecated. Since /v1/toolgroups has both GET (not deprecated) and POST
(deprecated), the entire path was excluded, removing the GET operation
and its response schema. Updated _filter_combined_schema to use
operation-level filtering, matching _filter_schema_by_version
Signed-off-by: Sébastien Han <seb@redhat.com>
Removes the need for the strong_typing and pyopenapi packages and purely
use Pydantic for schema generation.
Our generator now purely relies on Pydantic and FastAPI, it is available
at `scripts/fastapi_generator.py`, you can run it like so:
```
uv run ./scripts/run_openapi_generator.sh
```
The generator will:
* Generate the deprecated, experimental, stable and combined specs
* Validate all the spec it generates against OpenAPI standards
A few changes in the schema required for oasdiff some updates so I've
made the following ignore rules. The new Pydantic-based generator is
likely more correct and follows OpenAPI standards better than the old
pyopenapi generator. Instead of trying to make the new generator match
the old one's quirks, we should focus on what's actually correct
according to OpenAPI standards.
These are non-critical changes:
* response-property-became-nullable: Backward compatible:
existing non-null values still work, now also accepts null
* response-required-property-removed: oasdiff reports a false
positive because it doesn't resolve $refs inside anyOf; we could use
tool like 'redocly' to flatten the schema to a single file.
* response-property-type-changed: properties are still object
types, but oasdiff doesn't resolve $refs, so it flags the missing
inline type: object even though the referenced schemas define type:
object
* request-property-one-of-removed: These are false positives
caused by schema restructuring (wrapping in anyOf for nullability,
using -Input variants, or simplifying nested oneOf structures)
that don't change the actual API contract - the same data types are
still accepted, just represented differently in the schema.
* request-parameter-enum-value-removed: These are false
positives caused by oasdiff not resolving $refs - the enum values
(asc, desc, assistants, batch) are still present in the referenced
schemas (Order and OpenAIFilePurpose), just represented via schema
references instead of inline enums.
* request-property-enum-value-removed: this is a false positive caused
by oasdiff not resolving $refs - the enum values (llm, embedding,
rerank) are still present in the referenced ModelType schema,
just represented via schema reference instead of inline enums.
* request-property-type-changed: These are schema quality issues
where type information is missing (due to Any fallback in dynamic
model creation), but the API contract remains unchanged -
properties still exist with correct names and defaults, so the same
requests will work.
* response-body-type-changed: These are false positives caused
by schema representation changes (from inferred/empty types to
explicit $ref schemas, or vice versa) - the actual response types
an API contract remain unchanged, just how they're represented in the
OpenAPI spec.
* response-media-type-removed: This is a false positive caused
by FastAPI's OpenAPI generator not documenting union return types with
AsyncIterator - the streaming functionality with text/event-stream
media type still works when stream=True is passed, it's just not
reflected in the generated OpenAPI spec.
* request-body-type-changed: This is a schema correction - the
old spec incorrectly represented the request body as an object, but
the function signature shows chunks: list[Chunk], so the new spec
correctly shows it as an array, matching the actual API
implementation.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
the directory structure was src/llama-stack-api/llama_stack_api
instead it should just be src/llama_stack_api to match the other
packages.
update the structure and pyproject/linting config
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Without this we get below in server logs
```
RuntimeError: OpenAI response failed: InferenceRouter._construct_metrics() got an unexpected keyword argument
'model_id'
```
Seems the method signature got update but this callsite was not updated
## Test Plan
CI and test with Sabre (Agent framework integration)
# What does this PR do?
Error out when creating vector store with unknown embedding model
Closes https://github.com/llamastack/llama-stack/issues/4047
## Test Plan
Added tests
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Extract API definitions and provider specifications into a standalone
llama-stack-api package that can be published to PyPI independently of
the main llama-stack server.
see: https://github.com/llamastack/llama-stack/pull/2978 and
https://github.com/llamastack/llama-stack/pull/2978#issuecomment-3145115942
Motivation
External providers currently import from llama-stack, which overrides
the installed version and causes dependency conflicts. This separation
allows external providers to:
- Install only the type definitions they need without server
dependencies
- Avoid version conflicts with the installed llama-stack package
- Be versioned and released independently
This enables us to re-enable external provider module tests that were
previously blocked by these import conflicts.
Changes
- Created llama-stack-api package with minimal dependencies (pydantic,
jsonschema)
- Moved APIs, providers datatypes, strong_typing, and schema_utils
- Updated all imports from llama_stack.* to llama_stack_api.*
- Configured local editable install for development workflow
- Updated linting and type-checking configuration for both packages
Next Steps
- Publish llama-stack-api to PyPI
- Update external provider dependencies
- Re-enable external provider module tests
Pre-cursor PRs to this one:
- #4093
- #3954
- #4064
These PRs moved key pieces _out_ of the Api pkg, limiting the scope of
change here.
relates to #3237
## Test Plan
Package builds successfully and can be imported independently. All
pre-commit hooks pass with expected exclusions maintained.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
- force a min precommit version
- pin to >= 4.3.0 when installing
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Building/Deploying docs is failing here:
5530320962 (step):8:49
Needs the playground file. Updated it to reflect current admin status.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Fixed bug where models with No provider_model_id were incorrectly
filtered from the startup config display. The function was checking
multiple fields when it should only filter items with explicitly
disabled provider_id.
Changes:
o Modified remove_disabled_providers to only check provider_id field o
Changed condition from checking multiple fields with None to only
checking provider_id for "__disabled__", None or empty string
o Added comprehensive unit tests
Closes: #4131
Signed-off-by: Derek Higgins <derekh@redhat.com>
We would like to run all OpenAI compatibility tests using only the
openai-client library. This is most friendly for contributors since they
can run tests without needing to update the client-sdks (which is
getting easier but still a long pole.)
This is the first step in enabling that -- no using "library client" for
any of the Responses tests. This seems like a reasonable trade-off since
the usage of an embeddeble library client for Responses (or any
OpenAI-compatible) behavior seems to be not very common. To do this, we
needed to enable MCP tests (which only worked in library client mode)
for server mode.
docs: Add comprehensive Files API and Vector Store integration
documentation
- Add Files API documentation with OpenAI-compatible endpoints
- Create comprehensive guide for OpenAI-compatible file operations
- Reorganize documentation structure: move file operations to files/
directory
- Add vector store provider documentation for Milvus, SQLite-vec, FAISS
- Clean up redundant files and improve navigation
- Update cross-references and eliminate documentation duplication
- Support for release 0.2.14 FileResponse and Vector Store API features
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
A few changes to the storage layer to ensure we reduce unnecessary
contention arising out of our design choices (and letting the database
layer do its correct thing):
- SQL stores now share a single `SqlAlchemySqlStoreImpl` per backend,
and `kvstore_impl` caches instances per `(backend, namespace)`. This
avoids spawning multiple SQLite connections for the same file, reducing
lock contention and aligning the cache story for all backends.
- Added an async upsert API (with SQLite/Postgres dialect inserts) and
routed it through `AuthorizedSqlStore`, then switched conversations and
responses to call it. Using native `ON CONFLICT DO UPDATE` eliminates
the insert-then-update retry window that previously caused long WAL lock
retries.
### Test Plan
Existing tests, added a unit test for `upsert()`
Fixes issues in the storage system by guaranteeing immediate durability
for responses and ensuring background writers stay alive. Three related
fixes:
* Responses to the OpenAI-compatible API now write directly to
Postgres/SQLite inside the request instead of detouring through an async
queue that might never drain; this restores the expected
read-after-write behavior and removes the "response not found" races
reported by users.
* The access-control shim was stamping owner_principal/access_attributes
as SQL NULL, which Postgres interprets as non-public rows; fixing it to
use the empty-string/JSON-null pattern means conversations and responses
stored without an authenticated user stay queryable (matching SQLite).
* The inference-store queue remains for batching, but its worker tasks
now start lazily on the live event loop so server startup doesn't cancel
them—writes keep flowing even when the stack is launched via llama stack
run.
Closes#4115
### Test Plan
Added a matrix entry to test our "base" suite against Postgres as the
store.
Updated documentation to accurately reflect current behavior where
models are identified as provider_id/provider_model_id in the system.
Changes:
o Clarify that model_id is for configuration purposes only o Explain
models are accessed as provider_id/provider_model_id o Remove outdated
aliasing example that suggested model_id could be used
as a custom identifier
This corrects the documentation which previously suggested model_id
could be used to create friendly aliases, which is not how the code
actually works.
Signed-off-by: Derek Higgins <derekh@redhat.com>
Help users find the comprehensive integration testing docs by linking to
the record-replay documentation. This clarifies that the technical
README complements the main docs.
# What does this PR do?
- Updates `/vector_stores/{vector_store_id}/files/{file_id}/content` to
allow returning `embeddings` and `metadata` using the `extra_query`
- Updates the UI accordingly to display them.
- Update UI to support CRUD operations in the Vector Stores section and
adds a new modal exposing the functionality.
- Updates Vector Store update to fail if a user tries to update Provider
ID (which doesn't make sense to allow)
```python
In [1]: client.vector_stores.files.content(
vector_store_id=vector_store.id,
file_id=file.id,
extra_query={"include_embeddings": True, "include_metadata": True}
)
Out [1]: FileContentResponse(attributes={}, content=[Content(text='This is a test document to check if embeddings are generated properly.\n', type='text', embedding=[0.33760684728622437, ...,], chunk_metadata={'chunk_id': '62a63ae0-c202-f060-1b86-0a688995b8d3', 'document_id': 'file-27291dbc679642ac94ffac6d2810c339', 'source': None, 'created_timestamp': 1762053437, 'updated_timestamp': 1762053437, 'chunk_window': '0-13', 'chunk_tokenizer': 'DEFAULT_TIKTOKEN_TOKENIZER', 'chunk_embedding_model': 'sentence-transformers/nomic
-ai/nomic-embed-text-v1.5', 'chunk_embedding_dimension': 768, 'content_token_count': 13, 'metadata_token_count': 9}, metadata={'filename': 'test-embedding.txt', 'chunk_id': '62a63ae0-c202-f060-1b86-0a688995b8d3', 'document_id': 'file-27291dbc679642ac94ffac6d2810c339', 'token_count': 13, 'metadata_token_count': 9})], file_id='file-27291dbc679642ac94ffac6d2810c339', filename='test-embedding.txt')
```
Screenshots of UI are displayed below:
### List Vector Store with Added "Create New Vector Store"
<img width="1912" height="491" alt="Screenshot 2025-11-06 at 10 47
25 PM"
src="https://github.com/user-attachments/assets/a3a3ddd9-758d-4005-ac9c-5047f03916f3"
/>
### Create New Vector Store
<img width="1918" height="1048" alt="Screenshot 2025-11-06 at 10 47
49 PM"
src="https://github.com/user-attachments/assets/b4dc0d31-696f-4e68-b109-27915090f158"
/>
### Edit Vector Store
<img width="1916" height="1355" alt="Screenshot 2025-11-06 at 10 48
32 PM"
src="https://github.com/user-attachments/assets/ec879c63-4cf7-489f-bb1e-57ccc7931414"
/>
### Vector Store Files Contents page (with Embeddings)
<img width="1914" height="849" alt="Screenshot 2025-11-06 at 11 54
32 PM"
src="https://github.com/user-attachments/assets/3095520d-0e90-41f7-83bd-652f6c3fbf27"
/>
### Vector Store Files Contents Details page (with Embeddings)
<img width="1916" height="1221" alt="Screenshot 2025-11-06 at 11 55
00 PM"
src="https://github.com/user-attachments/assets/e71dbdc5-5b49-472b-a43a-5785f58d196c"
/>
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Tests added for Middleware extension and Provider failures.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Add explicit connection cleanup and shorter timeouts to OpenAI client
fixtures. Fixes CI deadlock after 25+ tests due to connection pool
exhaustion. Also adds 60s timeout to test_conversation_context_loading
as safety net.
## Test Plan
tests pass
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
This PR adds Stainless config to specify the Meta copyright file header
for generated files.
Doing it via config instead of custom code will reduce the probability
of git conflict.
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
- review preview builds
Update pypdf dependency to address vulnerabilities causing potential
denial of service through infinite loops or excessive memory usage when
handling malicious PDFs. The update remains fully backward compatible,
with no changes to the PdfReader API.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Fixes#4120
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
In the **Detailed Tutorial**, at **Step 3**, the **Install with venv**
option creates a new virtual environment `client`, activates it then
attempts to install the llama-stack-client using pip.
```
uv venv client --python 3.12
source client/bin/activate
pip install llama-stack-client <- this is the problematic line
```
However, the pip command will likely fail because the `uv venv` command
doesn't, by default, include adding the pip command to the virtual
environment that is created. The pip command will error either because
pip doesn't exist at all, or, if the pip command does exist outside of
the virtual environment, return a different error message. The latter
may be unclear to the user why it is failing.
This PR changes 'pip' to 'uv pip', allowing the install action to
function in the virtual environment as intended, and without the need
for pip to be installed.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
1. Use linux or WSL (virtual environments on Windows use `Scripts`
folder instead of `bin` [virtualenv
#993ba13](993ba1316a)
which doesn't align with the tutorial)
2. Clone the `llama-stack` repo
3. Run the following and verify success:
```
uv venv client --python 3.12
source client/bin/activate
```
5. Run the updated command:
```
uv pip install llama-stack-client
```
6. Observe the console output confirms that the virtual environment
`client` was used:
> Using Python 3.12.3 environment at: **client**
# What does this PR do?
the inspect API lacked any mechanism to get all
non-deprecated APIs (v1, v1alpha, v1beta)
change default to this behavior
'v1' filter can be used for user' wanting a list
of stable APIs
## Test Plan
1. pull the PR
2. launch a LLS server
3. run `curl http://beanlab3.bss.redhat.com:8321/v1/inspect/routes`
4. note there are APIs for `v1`, `v1alpha`, and `v1beta` but no
deprecated APIs
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
Delete ~2,000 lines of dead code from the old bespoke inference API that
was replaced by OpenAI-only API. This includes removing unused type
conversion functions, dead provider methods, and event_logger.py.
Clean up imports across the codebase to remove references to deleted
types. This eliminates unnecessary
code and dependencies, helping isolate the API package as a
self-contained module.
This is the last interdependency between the .api package and "exterior"
packages, meaning that now every other package in llama stack imports
the API, not the other way around.
## Test Plan
this is a structural change, no tests needed.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# Problem
Responses API uses max_tool_calls parameter to limit the number of tool
calls that can be generated in a response. Currently, LLS implementation
of the Responses API does not support this parameter.
# What does this PR do?
This pull request adds the max_tool_calls field to the response object
definition and updates the inline provider. it also ensures that:
- the total number of calls to built-in and mcp tools do not exceed
max_tool_calls
- an error is thrown if max_tool_calls < 1 (behavior seen with the
OpenAI Responses API, but we can change this if needed)
Closes #[3563](https://github.com/llamastack/llama-stack/issues/3563)
## Test Plan
- Tested manually for change in model response w.r.t supplied
max_tool_calls field.
- Added integration tests to test invalid max_tool_calls parameter.
- Added integration tests to check max_tool_calls parameter with
built-in and function tools.
- Added integration tests to check max_tool_calls parameter in the
returned response object.
- Recorded OpenAI Responses API behavior using a sample script:
https://github.com/s-akhtar-baig/llama-stack-examples/blob/main/responses/src/max_tool_calls.py
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Adds OCI GenAI PaaS models for openai chat completion endpoints.
## Test Plan
In an OCI tenancy with access to GenAI PaaS, perform the following
steps:
1. Ensure you have IAM policies in place to use service (check docs
included in this PR)
2. For local development, [setup OCI
cli](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm)
and configure the CLI with your region, tenancy, and auth
[here](https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliconfigure.htm)
3. Once configured, go through llama-stack setup and run llama-stack
(uses config based auth) like:
```bash
OCI_AUTH_TYPE=config_file \
OCI_CLI_PROFILE=CHICAGO \
OCI_REGION=us-chicago-1 \
OCI_COMPARTMENT_OCID=ocid1.compartment.oc1..aaaaaaaa5...5a \
llama stack run oci
```
4. Hit the `models` endpoint to list models after server is running:
```bash
curl http://localhost:8321/v1/models | jq
...
{
"identifier": "meta.llama-4-scout-17b-16e-instruct",
"provider_resource_id": "ocid1.generativeaimodel.oc1.us-chicago-1.am...q",
"provider_id": "oci",
"type": "model",
"metadata": {
"display_name": "meta.llama-4-scout-17b-16e-instruct",
"capabilities": [
"CHAT"
],
"oci_model_id": "ocid1.generativeaimodel.oc1.us-chicago-1.a...q"
},
"model_type": "llm"
},
...
```
5. Use the "display_name" field to use the model in a
`/chat/completions` request:
```bash
# Streaming result
curl -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta.llama-4-scout-17b-16e-instruct",
"stream": true,
"temperature": 0.9,
"messages": [
{
"role": "system",
"content": "You are a funny comedian. You can be crass."
},
{
"role": "user",
"content": "Tell me a funny joke about programming."
}
]
}'
# Non-streaming result
curl -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta.llama-4-scout-17b-16e-instruct",
"stream": false,
"temperature": 0.9,
"messages": [
{
"role": "system",
"content": "You are a funny comedian. You can be crass."
},
{
"role": "user",
"content": "Tell me a funny joke about programming."
}
]
}'
```
6. Try out other models from the `/models` endpoint.
Mark all register_* / unregister_* APIs as deprecated across models,
shields, tool groups, datasets, benchmarks, and scoring functions. This
is the first step toward moving resource mutations to an `/admin`
namespace as outlined in
https://github.com/llamastack/llama-stack/issues/3809#issuecomment-3492931585.
The deprecation flag will be reflected in the OpenAPI schema to warn API
users that these endpoints are being phased out. Next step will be
implementing the `/admin` route namespace for these resource management
operations.
- `register_model` / `unregister_model`
- `register_shield` / `unregister_shield`
- `register_tool_group` / `unregister_toolgroup`
- `register_dataset` / `unregister_dataset`
- `register_benchmark` / `unregister_benchmark`
- `register_scoring_function` / `unregister_scoring_function`