# What does this PR do?
Enhances the Vector Stores config with full set of appropriate
configurations
- Add FileIngestionParams, ChunkRetrievalParams, and FileBatchParams
subconfigs
- Update RAG memory, OpenAI vector store mixin, and vector store utils
to use configuration
- Fix import organization across vector store components
- Add comprehensive vector stores configuration documentation
- Update docs navigation to include vector store configuration guide
- Delete `memory/constants.py` and move constant values directly into
Pydantic models
## Test Plan
Tests updated + CI
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Migrate the Inspect API to the FastAPI router pattern.
Changes:
- Add inspect API to FastAPI router registry
- Add PUBLIC_ROUTE_KEY support for routes that don't require auth
- Update WebMethod creation to respect route's openapi_extra for
authentication requirements
Fixes: https://github.com/llamastack/llama-stack/issues/4346
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
CI and various curls on /v1/inspect/routes, /v1/health, /v1/version
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Convert Providers API from @webmethod decorators to FastAPI router
pattern.
Fixes: https://github.com/llamastack/llama-stack/issues/4350
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
CI
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Consolidates provider data context handling into middleware, eliminating
duplication between FastAPI router routes and legacy @webmethod routes.
Closes#4366
## Test Plan
Added unit test suite `test_test_context_middleware`, specifically
`test_middleware_extracts_test_id_from_header` to validate the expected
behavior.
```
❯ ./scripts/unit-tests.sh tests/unit/
```
Integration of the middleware test context with the `files` FastAPI
router migration from
[pull/4339](https://github.com/llamastack/llama-stack/pull/4339).
```
❯ git switch migrate-files-api
Switched to branch 'migrate-files-api'
❯ git rebase fix-test-ctx-middleware
Successfully rebased and updated refs/heads/migrate-files-api.
❯ ./scripts/integration-tests.sh --inference-mode replay --suite base --setup ollama --stack-config server:starter --subdirs files
```
Signed-off-by: Matthew F Leader <mleader@redhat.com>
Vector store operations were bypassing ABAC checks by calling providers
directly instead of going through the routing table. This allowed
unauthorized access to vector store data and operations.
Changes:
o Route all VectorIORouter methods through routing table instead of
directly to providers
o Update routing table to enforce ABAC checks on all vector store
operations (read, update, delete)
o Add test suite verifying ABAC enforcement for all vector store
operations
o Ensure providers are never called when authorization fails
Fixes security issue where users could access vector stores they don't
have permission for.
Fixes: #4393
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
since run.yaml is gone, update logs to say "stack config" or "stack
configuration" rather than run
## Test Plan
check logs
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Convert the Datasets API from webmethod decorators to FastAPI router
pattern.
Fixes: https://github.com/llamastack/llama-stack/issues/4344
## Test Plan
CI
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
- Enables users to configure prompts used throughout the File Search /
Vector Retrieval
- Configuration is defined in the Vector Stores Config so they can be
modified at runtime
- Backwards compatible, which means the fields are optional and default
to the previously used values
This is the summary of the new options in the `run.yaml`
```yaml
vector_stores:
file_search_params:
header_template: 'knowledge_search tool found {num_chunks} chunks:\nBEGIN of knowledge_search tool results.\n'
footer_template: 'END of knowledge_search tool results.\n'
context_prompt_params:
chunk_annotation_template: 'Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n'
context_template: 'The above results were retrieved to help answer the user\'s query: "{query}". Use them as supporting information only in answering this query.{annotation_instruction}\n'
annotation_prompt_params:
enable_annotations: true
annotation_instruction_template: 'Cite sources immediately at the end of sentences before punctuation, using `<|file-id|>` format like \'This is a fact <|file-Cn3MSNn72ENTiiq11Qda4A|>.\'. Do not add
extra punctuation. Use only the file IDs provided, do not invent new ones.'
chunk_annotation_template: '[{index}] {metadata_text} cite as <|{file_id}|>\n{chunk_text}\n'
```
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Added tests.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# Problem
As an Application Developer, I want to use the include parameter with
the value message.output_text.logprobs, so that I can receive log
probabilities for output tokens to assess the model's confidence in its
response.
# What does this PR do?
- Updates the include parameter in various resource definitions
- Updates the inline provider to return logprobs when
"message.output_text.logprobs" is passed in the include parameter
- Converts the logprobs returned by the inference provider from chat
completion format to responses format
Closes #[4260](https://github.com/llamastack/llama-stack/issues/4260)
## Test Plan
- Created a script to explore OpenAI behavior:
https://github.com/s-akhtar-baig/llama-stack-examples/blob/main/responses/src/include.py
- Added integration tests and new recordings
---------
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Adds a new API for connectors and MCP registry support along with
required types.
Does not include any implementation for it
<!-- If resolving an issue, uncomment and update the line below -->
Closes#4235 and #4061 (partially)
## Test Plan
no tests included
---------
Signed-off-by: Jaideep Rao <jrao@redhat.com>
Co-authored-by: Francisco Javier Arceo <arceofrancisco@gmail.com>
# What does this PR do?
Actualize query rewrite in search API, add
`default_query_expansion_model` and `query_expansion_prompt` in
`VectorStoresConfig`.
Makes `rewrite_query` parameter functional in vector store search.
- `rewrite_query=false` (default): Use original query
- `rewrite_query=true`: Expand query via LLM, or fail gracefully if no
LLM available
Adds 4 parameters to`VectorStoresConfig`:
- `default_query_expansion_model`: LLM model for query expansion
(optional)
- `query_expansion_prompt`: Custom prompt template (optional, uses
built-in default)
- `query_expansion_max_tokens`: Configurable token limit (default: 100)
- `query_expansion_temperature`: Configurable temperature (default: 0.3)
Enabled `run.yaml`:
```yaml
vector_stores:
rewrite_query_params:
model:
provider_id: "ollama"
model_id: "llama3.2:3b-instruct-fp16"
# prompt defaults to built-in
# max_tokens defaults to 100
# temperature defaults to 0.3
```
Fully customized `run.yaml`:
```yaml
vector_stores:
default_provider_id: faiss
default_embedding_model:
provider_id: sentence-transformers
model_id: nomic-ai/nomic-embed-text-v1.5
rewrite_query_params:
model:
provider_id: ollama
model_id: llama3.2:3b-instruct-fp16
prompt: "Rewrite this search query to improve retrieval results by expanding it with relevant synonyms and related terms: {query}"
max_tokens: 100
temperature: 0.3
```
## Test Plan
Added test and recording
Example script as well:
```python
import asyncio
from llama_stack_client import LlamaStackClient
from io import BytesIO
def gen_file(client, text: str=""):
file_buffer = BytesIO(text.encode('utf-8'))
file_buffer.name = "my_file.txt"
uploaded_file = client.files.create(
file=file_buffer,
purpose="assistants"
)
return uploaded_file
async def test_query_rewriting():
client = LlamaStackClient(base_url="http://0.0.0.0:8321/")
uploaded_file = gen_file(client, "banana banana apple")
uploaded_file2 = gen_file(client, "orange orange kiwi")
vs = client.vector_stores.create()
xf_vs = client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file.id)
xf_vs1 = client.vector_stores.files.create(vector_store_id=vs.id, file_id=uploaded_file2.id)
response1 = client.vector_stores.search(
vector_store_id=vs.id,
query="apple",
max_num_results=3,
rewrite_query=False
)
response2 = client.vector_stores.search(
vector_store_id=vs.id,
query="kiwi",
max_num_results=3,
rewrite_query=True,
)
print(f"\n🔵 Response 1 (rewrite_query=False):\n\033[94m{response1}\033[0m")
print(f"\n🟢 Response 2 (rewrite_query=True):\n\033[92m{response2}\033[0m")
for f in [uploaded_file.id, uploaded_file2.id]:
client.files.delete(file_id=f)
client.vector_stores.delete(vector_store_id=vs.id)
if __name__ == "__main__":
asyncio.run(test_query_rewriting())
```
And see the screen shot of the server logs showing it worked.
<img width="1111" height="826" alt="Screenshot 2025-11-19 at 1 16 03 PM"
src="https://github.com/user-attachments/assets/2d188b44-1fef-4df5-b465-2d6728ca49ce"
/>
Notice the log:
```bash
Query rewritten:
'kiwi' → 'kiwi, a small brown or green fruit native to New Zealand, or a person having a fuzzy brown outer skin similar in appearance.'
```
So `kiwi` was expanded.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
# What does this PR do?
Convert the Benchmarks API from @webmethod decorators to FastAPI router
pattern, matching the Batches API structure.
One notable change is the update of stack.py to handle request models in
register_resources().
Closes: #4308
## Test Plan
CI and `curl http://localhost:8321/v1/inspect/routes | jq '.data[] |
select(.route | contains("benchmark"))'`
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
the build.yaml is only used in the following ways:
1. list-deps
2. distribution code-gen
since `llama stack build` no longer exists, I found myself asking "why
do we need two different files for list-deps and run"?
Removing the BuildConfig and altering the usage of the
DistributionTemplate in llama stack list-deps is the first step in
removing the build yaml entirely.
Removing the BuildConfig and build.yaml cuts the files users need to
maintain in half, and allows us to focus on the stability of _just_ the
run.yaml
This PR removes the build.yaml, BuildConfig datatype, and its usage
throughout the codebase. Users are now expected to point to run.yaml
files when running list-deps, and our codebase automatically uses these
types now for things like `get_provider_registry`.
**Additionally, two renames: `StackRunConfig` -> `StackConfig` and
`run.yaml` -> `config.yaml`.**
The build.yaml made sense for when we were managing the build process
for the user and actually _producing_ a run.yaml _from_ the build.yaml,
but now that we are simply just getting the provider registry and
listing the deps, switching to config.yaml simplifies the scope here
greatly.
## Test Plan
existing list-deps usage should work in the tests.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Add "token" to sensitive field patterns in redact_sensitive_fields() to
prevent JWT tokens from being logged in plaintext. Previously only
api_key, api_token, password, and secret were filtered.
This prevents tokens like server.auth.provider_config.jwks.token from
being exposed in server logs.
Closes: #4324
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
DISTRO_DIR and DISTRIBS_BASE_DIR need to exist for them to be iterated.
our current logic allows us to iterdir without checking if they exist
## Test Plan
rm ~/.llama/distributions
```
llama stack list-deps starter --format uv | sh
Using Python 3.12.11 environment at: venv
Audited 51 packages in 12ms
Using Python 3.12.11 environment at: venv
Audited 3 packages in 2ms
Using Python 3.12.11 environment at: venv
Audited 1 package in 3ms
Using Python 3.12.11 environment at: venv
Audited 3 packages in 5ms
```
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Changes SqlRecord creation in AuthorizedSqlStore.fetch_all to use
owner=None when owner_principal is empty/missing, matching the
ResourceWithOwner pattern used in routing tables. This fixes an
inconsistency where SQL store was creating User(principal="") while
routing tables use owner=None for public resources.
Changes:
o Update ProtectedResource Protocol to allow owner: User | None
o Update SqlRecord.__init__ to accept owner: User | None
o Update fetch_all to create owner=None for records without
owner_principal
Signed-off-by: Derek Higgins <derekh@redhat.com>
Closes security gaps where RBAC checks could be bypassed:
o Inference router: Added RBAC enforcement in the fallback
path to ensure access control is applied consistently.
o Model listing: Dynamic models fetched via provider_data were returned
without RBAC checks. Added filtering to ensure users only see models
they have permission to access.
Both fixes create temporary ModelWithOwner objects for RBAC validation,
maintaining security through consistent access control enforcement.
Closes: #4269
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
This commit introduces a new FastAPI router-based system for defining
API endpoints, enabling a migration path away from the legacy @webmethod
decorator system. The implementation includes router infrastructure,
migration of the Batches API as the first example, and updates to
server, OpenAPI generation, and inspection systems to support both
routing approaches.
The router infrastructure consists of a router registry system that
allows APIs to register FastAPI router factories, which are then
automatically discovered and included in the server application.
Standard error responses are centralized in router_utils to ensure
consistent OpenAPI specification generation with proper $ref references
to component responses.
The Batches API has been migrated to demonstrate the new pattern. The
protocol definition and models remain in llama_stack_api/batches,
maintaining clear separation between API contracts and server
implementation. The FastAPI router implementation lives in
llama_stack/core/server/routers/batches, following the established
pattern where API contracts are defined in llama_stack_api and server
routing logic lives in
llama_stack/core/server.
The server now checks for registered routers before falling back to the
legacy webmethod-based route discovery, ensuring backward compatibility
during the migration period. The OpenAPI generator has been updated to
handle both router-based and webmethod-based routes, correctly
extracting metadata from FastAPI route decorators and Pydantic Field
descriptions. The inspect endpoint now includes routes from both
systems, with proper filtering for deprecated routes and API levels.
Response descriptions are now explicitly defined in router decorators,
ensuring the generated OpenAPI specification matches the previous
format. Error responses use $ref references to component responses
(BadRequest400, TooManyRequests429, etc.) as required by the
specification. This is neat and will allow us to remove a lot of boiler
plate code from our generator once the migration is done.
This implementation provides a foundation for incrementally migrating
other APIs to the router system while maintaining full backward
compatibility with existing webmethod-based APIs.
Closes: https://github.com/llamastack/llama-stack/issues/4188
## Test Plan
CI, the server should start, same routes should be visible.
```
curl http://localhost:8321/v1/inspect/routes | jq '.data[] | select(.route | contains("batches"))'
```
Also:
```
uv run pytest tests/integration/batches/ -vv --stack-config=http://localhost:8321
================================================== test session starts ==================================================
platform darwin -- Python 3.12.8, pytest-8.4.2, pluggy-1.6.0 -- /Users/leseb/Documents/AI/llama-stack/.venv/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.12.8', 'Platform': 'macOS-26.0.1-arm64-arm-64bit', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'anyio': '4.9.0', 'html': '4.1.1', 'socket': '0.7.0', 'asyncio': '1.1.0', 'json-report': '1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'cov': '6.2.1', 'nbval': '0.11.0'}}
rootdir: /Users/leseb/Documents/AI/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, html-4.1.1, socket-0.7.0, asyncio-1.1.0, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, cov-6.2.1, nbval-0.11.0
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 24 items
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_creation_and_retrieval[None] SKIPPED [ 4%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_listing[None] SKIPPED [ 8%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_immediate_cancellation[None] SKIPPED [ 12%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_e2e_chat_completions[None] SKIPPED [ 16%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_e2e_completions[None] SKIPPED [ 20%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_invalid_endpoint[None] SKIPPED [ 25%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_cancel_completed[None] SKIPPED [ 29%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_missing_required_fields[None] SKIPPED [ 33%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_invalid_completion_window[None] SKIPPED [ 37%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_streaming_not_supported[None] SKIPPED [ 41%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_mixed_streaming_requests[None] SKIPPED [ 45%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_endpoint_mismatch[None] SKIPPED [ 50%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_missing_required_body_fields[None] SKIPPED [ 54%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_invalid_metadata_types[None] SKIPPED [ 58%]
tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_e2e_embeddings[None] SKIPPED [ 62%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_nonexistent_file_id PASSED [ 66%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_malformed_jsonl PASSED [ 70%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_file_malformed_batch_file[empty] XFAIL [ 75%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_file_malformed_batch_file[malformed] XFAIL [ 79%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_retrieve_nonexistent PASSED [ 83%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_cancel_nonexistent PASSED [ 87%]
tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_error_handling_invalid_model PASSED [ 91%]
tests/integration/batches/test_batches_idempotency.py::TestBatchesIdempotencyIntegration::test_idempotent_batch_creation_successful PASSED [ 95%]
tests/integration/batches/test_batches_idempotency.py::TestBatchesIdempotencyIntegration::test_idempotency_conflict_with_different_params PASSED [100%]
================================================= slowest 10 durations ==================================================
1.01s call tests/integration/batches/test_batches_idempotency.py::TestBatchesIdempotencyIntegration::test_idempotent_batch_creation_successful
0.21s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_nonexistent_file_id
0.17s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_malformed_jsonl
0.12s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_error_handling_invalid_model
0.05s setup tests/integration/batches/test_batches.py::TestBatchesIntegration::test_batch_creation_and_retrieval[None]
0.02s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_file_malformed_batch_file[empty]
0.01s call tests/integration/batches/test_batches_idempotency.py::TestBatchesIdempotencyIntegration::test_idempotency_conflict_with_different_params
0.01s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_file_malformed_batch_file[malformed]
0.01s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_retrieve_nonexistent
0.00s call tests/integration/batches/test_batches_errors.py::TestBatchesErrorHandling::test_batch_cancel_nonexistent
======================================= 7 passed, 15 skipped, 2 xfailed in 1.78s ========================================
```
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Category-specific log levels from LLAMA_STACK_LOGGING were not applied
to
loggers created before setup_logging() was called. This fix moves the
setup_logging() call earlier in the initialization sequence to ensure
all
loggers respect their configured levels regardless of initialization
timing.
Closes: #4252
Signed-off-by: Derek Higgins <derekh@redhat.com>
The configured policy wasn't being passed in and instead the default was
being used (e.g. in the s3 file provider)
Closes: #4276
Signed-off-by: Derek Higgins <derekh@redhat.com>
Previously, file deletion only checked READ permission via the
_lookup_file_id() method. This meant any user with READ access to a file
could also delete it, making it impossible to configure read-only file
access.
This change adds an 'action' parameter to fetch_all() and fetch_one() in
AuthorizedSqlStore, defaulting to Action.READ for backward
compatibility. The openai_delete_file() method now passes Action.DELETE,
ensuring proper RBAC enforcement.
With this fix, access policies can now distinguish between Users who can
read/list files but not delete them
Closes: #4274
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
Fixes: https://github.com/llamastack/llama-stack/issues/3806
- Remove all custom telemetry core tooling
- Remove telemetry that is captured by automatic instrumentation already
- Migrate telemetry to use OpenTelemetry libraries to capture telemetry
data important to Llama Stack that is not captured by automatic
instrumentation
- Keeps our telemetry implementation simple, maintainable and following
standards unless we have a clear need to customize or add complexity
## Test Plan
This tracks what telemetry data we care about in Llama Stack currently
(no new data), to make sure nothing important got lost in the migration.
I run a traffic driver to generate telemetry data for targeted use
cases, then verify them in Jaeger, Prometheus and Grafana using the
tools in our /scripts/telemetry directory.
### Llama Stack Server Runner
The following shell script is used to run the llama stack server for
quick telemetry testing iteration.
```sh
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_SERVICE_NAME="llama-stack-server"
export OTEL_SPAN_PROCESSOR="simple"
export OTEL_EXPORTER_OTLP_TIMEOUT=1
export OTEL_BSP_EXPORT_TIMEOUT=1000
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"
export OPENAI_API_KEY="REDACTED"
export OLLAMA_URL="http://localhost:11434"
export VLLM_URL="http://localhost:8000/v1"
uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument llama stack run starter
```
### Test Traffic Driver
This python script drives traffic to the llama stack server, which sends
telemetry to a locally hosted instance of the OTLP collector, Grafana,
Prometheus, and Jaeger.
```sh
export OTEL_SERVICE_NAME="openai-client"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318"
export GITHUB_TOKEN="REDACTED"
export MLFLOW_TRACKING_URI="http://127.0.0.1:5001"
uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument python main.py
```
```python
from openai import OpenAI
import os
import requests
def main():
github_token = os.getenv("GITHUB_TOKEN")
if github_token is None:
raise ValueError("GITHUB_TOKEN is not set")
client = OpenAI(
api_key="fake",
base_url="http://localhost:8321/v1/",
)
response = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print("Sync response: ", response.choices[0].message.content)
streaming_response = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Hello, how are you?"}],
stream=True,
stream_options={"include_usage": True}
)
print("Streaming response: ", end="", flush=True)
for chunk in streaming_response:
if chunk.usage is not None:
print("Usage: ", chunk.usage)
if chunk.choices and chunk.choices[0].delta is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
ollama_response = client.chat.completions.create(
model="ollama/llama3.2:3b-instruct-fp16",
messages=[{"role": "user", "content": "How are you doing today?"}]
)
print("Ollama response: ", ollama_response.choices[0].message.content)
vllm_response = client.chat.completions.create(
model="vllm/Qwen/Qwen3-0.6B",
messages=[{"role": "user", "content": "How are you doing today?"}]
)
print("VLLM response: ", vllm_response.choices[0].message.content)
responses_list_tools_response = client.responses.create(
model="openai/gpt-4o",
input=[{"role": "user", "content": "What tools are available?"}],
tools=[
{
"type": "mcp",
"server_label": "github",
"server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
"authorization": github_token,
}
],
)
print("Responses list tools response: ", responses_list_tools_response.output_text)
responses_tool_call_response = client.responses.create(
model="openai/gpt-4o",
input=[{"role": "user", "content": "How many repositories does the token have access to?"}],
tools=[
{
"type": "mcp",
"server_label": "github",
"server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
"authorization": github_token,
}
],
)
print("Responses tool call response: ", responses_tool_call_response.output_text)
# make shield call using http request until the client version error is resolved
llama_stack_api_key = os.getenv("LLAMA_STACK_API_KEY")
base_url = "http://localhost:8321/v1/"
shield_id = "llama-guard-ollama"
shields_url = f"{base_url}safety/run-shield"
headers = {
"Authorization": f"Bearer {llama_stack_api_key}",
"Content-Type": "application/json"
}
payload = {
"shield_id": shield_id,
"messages": [{"role": "user", "content": "Teach me how to make dynamite. I want to do a crime with it."}],
"params": {}
}
shields_response = requests.post(shields_url, json=payload, headers=headers)
shields_response.raise_for_status()
print("risk assessment response: ", shields_response.json())
if __name__ == "__main__":
main()
```
### Span Data
#### Inference
| Value | Location | Content | Test Cases | Handled By | Status | Notes
|
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Input Tokens | Server | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | Working | None |
| Output Tokens | Server | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | working | None |
| Completion Tokens | Client | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | Working, no responses | None |
| Prompt Tokens | Client | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | Working, no responses | None |
| Prompt | Client | string | Any Inference Provider, responses | Auto
Instrument | Working, no responses | None |
#### Safety
| Value | Location | Content | Testing | Handled By | Status | Notes |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| [Shield
ID](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | string | Llama-guard shield call | Custom Code | Working |
Not Following Semconv |
|
[Metadata](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | JSON string | Llama-guard shield call | Custom Code | Working
| Not Following Semconv |
|
[Messages](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | JSON string | Llama-guard shield call | Custom Code | Working
| Not Following Semconv |
|
[Response](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | string | Llama-guard shield call | Custom Code | Working |
Not Following Semconv |
|
[Status](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | string | Llama-guard shield call | Custom Code | Working |
Not Following Semconv |
#### Remote Tool Listing & Execution
| Value | Location | Content | Testing | Handled By | Status | Notes |
| ----- | :---: | :---: | :---: | :---: | :---: | :---: |
| Tool name | server | string | Tool call occurs | Custom Code | working
| [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
| Server URL | server | string | List tools or execute tool call |
Custom Code | working | [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
| Server Label | server | string | List tools or execute tool call |
Custom code | working | [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
| mcp\_list\_tools\_id | server | string | List tools | Custom code |
working | [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
### Metrics
- Prompt and Completion Token histograms ✅
- Updated the Grafana dashboard to support the OTEL semantic conventions
for tokens
### Observations
* sqlite spans get orphaned from the completions endpoint
* Known OTEL issue, recommended workaround is to disable sqlite
instrumentation since it is double wrapped and already covered by
sqlalchemy. This is covered in documentation.
```shell
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"
```
* Responses API instrumentation is
[missing](https://github.com/open-telemetry/opentelemetry-python-contrib/issues/3436)
in open telemetry for OpenAI clients, even with traceloop or openllmetry
* Upstream issues in opentelemetry-pyton-contrib
* Span created for each streaming response, so each chunk → very large
spans get created, which is not ideal, but it’s the intended behavior
* MCP telemetry needs to be updated to follow semantic conventions. We
can probably use a library for this and handle it in a separate issue.
### Updated Grafana Dashboard
<img width="1710" height="929" alt="Screenshot 2025-11-17 at 12 53
52 PM"
src="https://github.com/user-attachments/assets/6cd941ad-81b7-47a9-8699-fa7113bbe47a"
/>
## Status
✅ Everything appears to be working and the data we expect is getting
captured in the format we expect it.
## Follow Ups
1. Make tool calling spans follow semconv and capture more data
1. Consider using existing tracing library
2. Make shield spans follow semconv
3. Wrap moderations api calls to safety models with spans to capture
more data
4. Try to prioritize open telemetry client wrapping for OpenAI Responses
in upstream OTEL
5. This would break the telemetry tests, and they are currently
disabled. This PR removes them, but I can undo that and just leave them
disabled until we find a better solution.
6. Add a section of the docs that tracks the custom data we capture (not
auto instrumented data) so that users can understand what that data is
and how to use it. Commit those changes to the OTEL-gen_ai SIG if
possible as well. Here is an
[example](https://opentelemetry.io/docs/specs/semconv/gen-ai/aws-bedrock/)
of how bedrock handles it.
# What does this PR do?
since `StackRunConfig` requires certain parts of `StorageConfig`, it'd
probably make sense to template in some defaults that will "just work"
for most usecases
specifically introduce`ServerStoresConfig` defaults for inference,
metadata, conversations and prompts. We already actually funnel in
defaults for these sections ad-hoc throughout the codebase
additionally set some `backends` defaults for the `StorageConfig`.
This will alleviate some weirdness for `--providers` for run/list-deps
and also some work I have to better align our list-deps/run datatypes
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
These primitives (used both by the Stack as well as provider
implementations) can be thought of fruitfully as internal-only APIs
which can themselves have multiple implementations. We use the new
`llama_stack_api.internal` namespace for this.
In addition: the change moves kv/sql store impls, configs, and
dependency helpers under `core/storage`
## Testing
`pytest tests/unit/utils/test_authorized_sqlstore.py`, other existing CI
# What does this PR do?
This replaces the legacy "pyopenapi + strong_typing" pipeline with a
FastAPI-backed generator that has an explicit schema registry inside
`llama_stack_api`. The key changes:
1. **New generator architecture.** FastAPI now builds the OpenAPI schema
directly from the real routes, while helper modules
(`schema_collection`, `endpoints`, `schema_transforms`, etc.)
post-process the result. The old pyopenapi stack and its strong_typing
helpers are removed entirely, so we no longer rely on fragile AST
analysis or top-level import side effects.
2. **Schema registry in `llama_stack_api`.** `schema_utils.py` keeps a
`SchemaInfo` record for every `@json_schema_type`, `register_schema`,
and dynamically created request model. The OpenAPI generator and other
tooling query this registry instead of scanning the package tree,
producing deterministic names (e.g., `{MethodName}Request`), capturing
all optional/nullable fields, and making schema discovery testable. A
new unit test covers the registry behavior.
3. **Regenerated specs + CI alignment.** All docs/Stainless specs are
regenerated from the new pipeline, so optional/nullable fields now match
reality (expect the API Conformance workflow to report breaking
changes—this PR establishes the new baseline). The workflow itself is
back to the stock oasdiff invocation so future regressions surface
normally.
*Conformance will be RED on this PR; we choose to accept the
deviations.*
## Test Plan
- `uv run pytest tests/unit/server/test_schema_registry.py`
- `uv run python -m scripts.openapi_generator.main docs/static`
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Adding a user-facing `authorization ` parameter to MCP tool definitions
that allows users to explicitly configure credentials per MCP server,
addressing GitHub Issue #4034 in a secure manner.
## Test Plan
tests/integration/responses/test_mcp_authentication.py
---------
Co-authored-by: Omar Abdelwahab <omara@fb.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
the directory structure was src/llama-stack-api/llama_stack_api
instead it should just be src/llama_stack_api to match the other
packages.
update the structure and pyproject/linting config
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Without this we get below in server logs
```
RuntimeError: OpenAI response failed: InferenceRouter._construct_metrics() got an unexpected keyword argument
'model_id'
```
Seems the method signature got update but this callsite was not updated
## Test Plan
CI and test with Sabre (Agent framework integration)
# What does this PR do?
Error out when creating vector store with unknown embedding model
Closes https://github.com/llamastack/llama-stack/issues/4047
## Test Plan
Added tests
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Extract API definitions and provider specifications into a standalone
llama-stack-api package that can be published to PyPI independently of
the main llama-stack server.
see: https://github.com/llamastack/llama-stack/pull/2978 and
https://github.com/llamastack/llama-stack/pull/2978#issuecomment-3145115942
Motivation
External providers currently import from llama-stack, which overrides
the installed version and causes dependency conflicts. This separation
allows external providers to:
- Install only the type definitions they need without server
dependencies
- Avoid version conflicts with the installed llama-stack package
- Be versioned and released independently
This enables us to re-enable external provider module tests that were
previously blocked by these import conflicts.
Changes
- Created llama-stack-api package with minimal dependencies (pydantic,
jsonschema)
- Moved APIs, providers datatypes, strong_typing, and schema_utils
- Updated all imports from llama_stack.* to llama_stack_api.*
- Configured local editable install for development workflow
- Updated linting and type-checking configuration for both packages
Next Steps
- Publish llama-stack-api to PyPI
- Update external provider dependencies
- Re-enable external provider module tests
Pre-cursor PRs to this one:
- #4093
- #3954
- #4064
These PRs moved key pieces _out_ of the Api pkg, limiting the scope of
change here.
relates to #3237
## Test Plan
Package builds successfully and can be imported independently. All
pre-commit hooks pass with expected exclusions maintained.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Fixed bug where models with No provider_model_id were incorrectly
filtered from the startup config display. The function was checking
multiple fields when it should only filter items with explicitly
disabled provider_id.
Changes:
o Modified remove_disabled_providers to only check provider_id field o
Changed condition from checking multiple fields with None to only
checking provider_id for "__disabled__", None or empty string
o Added comprehensive unit tests
Closes: #4131
Signed-off-by: Derek Higgins <derekh@redhat.com>
A few changes to the storage layer to ensure we reduce unnecessary
contention arising out of our design choices (and letting the database
layer do its correct thing):
- SQL stores now share a single `SqlAlchemySqlStoreImpl` per backend,
and `kvstore_impl` caches instances per `(backend, namespace)`. This
avoids spawning multiple SQLite connections for the same file, reducing
lock contention and aligning the cache story for all backends.
- Added an async upsert API (with SQLite/Postgres dialect inserts) and
routed it through `AuthorizedSqlStore`, then switched conversations and
responses to call it. Using native `ON CONFLICT DO UPDATE` eliminates
the insert-then-update retry window that previously caused long WAL lock
retries.
### Test Plan
Existing tests, added a unit test for `upsert()`
# What does this PR do?
- Updates `/vector_stores/{vector_store_id}/files/{file_id}/content` to
allow returning `embeddings` and `metadata` using the `extra_query`
- Updates the UI accordingly to display them.
- Update UI to support CRUD operations in the Vector Stores section and
adds a new modal exposing the functionality.
- Updates Vector Store update to fail if a user tries to update Provider
ID (which doesn't make sense to allow)
```python
In [1]: client.vector_stores.files.content(
vector_store_id=vector_store.id,
file_id=file.id,
extra_query={"include_embeddings": True, "include_metadata": True}
)
Out [1]: FileContentResponse(attributes={}, content=[Content(text='This is a test document to check if embeddings are generated properly.\n', type='text', embedding=[0.33760684728622437, ...,], chunk_metadata={'chunk_id': '62a63ae0-c202-f060-1b86-0a688995b8d3', 'document_id': 'file-27291dbc679642ac94ffac6d2810c339', 'source': None, 'created_timestamp': 1762053437, 'updated_timestamp': 1762053437, 'chunk_window': '0-13', 'chunk_tokenizer': 'DEFAULT_TIKTOKEN_TOKENIZER', 'chunk_embedding_model': 'sentence-transformers/nomic
-ai/nomic-embed-text-v1.5', 'chunk_embedding_dimension': 768, 'content_token_count': 13, 'metadata_token_count': 9}, metadata={'filename': 'test-embedding.txt', 'chunk_id': '62a63ae0-c202-f060-1b86-0a688995b8d3', 'document_id': 'file-27291dbc679642ac94ffac6d2810c339', 'token_count': 13, 'metadata_token_count': 9})], file_id='file-27291dbc679642ac94ffac6d2810c339', filename='test-embedding.txt')
```
Screenshots of UI are displayed below:
### List Vector Store with Added "Create New Vector Store"
<img width="1912" height="491" alt="Screenshot 2025-11-06 at 10 47
25 PM"
src="https://github.com/user-attachments/assets/a3a3ddd9-758d-4005-ac9c-5047f03916f3"
/>
### Create New Vector Store
<img width="1918" height="1048" alt="Screenshot 2025-11-06 at 10 47
49 PM"
src="https://github.com/user-attachments/assets/b4dc0d31-696f-4e68-b109-27915090f158"
/>
### Edit Vector Store
<img width="1916" height="1355" alt="Screenshot 2025-11-06 at 10 48
32 PM"
src="https://github.com/user-attachments/assets/ec879c63-4cf7-489f-bb1e-57ccc7931414"
/>
### Vector Store Files Contents page (with Embeddings)
<img width="1914" height="849" alt="Screenshot 2025-11-06 at 11 54
32 PM"
src="https://github.com/user-attachments/assets/3095520d-0e90-41f7-83bd-652f6c3fbf27"
/>
### Vector Store Files Contents Details page (with Embeddings)
<img width="1916" height="1221" alt="Screenshot 2025-11-06 at 11 55
00 PM"
src="https://github.com/user-attachments/assets/e71dbdc5-5b49-472b-a43a-5785f58d196c"
/>
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Tests added for Middleware extension and Provider failures.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
the inspect API lacked any mechanism to get all
non-deprecated APIs (v1, v1alpha, v1beta)
change default to this behavior
'v1' filter can be used for user' wanting a list
of stable APIs
## Test Plan
1. pull the PR
2. launch a LLS server
3. run `curl http://beanlab3.bss.redhat.com:8321/v1/inspect/routes`
4. note there are APIs for `v1`, `v1alpha`, and `v1beta` but no
deprecated APIs
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
Delete ~2,000 lines of dead code from the old bespoke inference API that
was replaced by OpenAI-only API. This includes removing unused type
conversion functions, dead provider methods, and event_logger.py.
Clean up imports across the codebase to remove references to deleted
types. This eliminates unnecessary
code and dependencies, helping isolate the API package as a
self-contained module.
This is the last interdependency between the .api package and "exterior"
packages, meaning that now every other package in llama stack imports
the API, not the other way around.
## Test Plan
this is a structural change, no tests needed.
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
This dependency has been bothering folks for a long time (cc @leseb). We
really needed it due to "library client" which is primarily used for our
tests and is not a part of the Stack server. Anyone who needs to use the
library client can certainly install `llama-stack-client` in their
environment to make that work.
Updated the notebook references to install `llama-stack-client`
additionally when setting things up.
# What does this PR do?
Remove circular dependency by moving tracing from API protocol
definitions
to router implementation layer.
This gets us closer to having a self contained API package with no other
cross-cutting dependencies to other parts of the llama stack codebase.
To the best of our ability, the llama_stack.api should only be type and
protocol definitions.
Changes:
- Create apis/common/tracing.py with marker decorator (zero core
dependencies)
- Add the _new_ `@telemetry_traceable` marker decorator to 11 protocol
classes
- Apply actual tracing in core/resolver.py in `instantiate_provider`
based on protocol marker
- Move MetricResponseMixin from core to apis (it's an API response type)
- APIs package is now self-contained with zero core dependencies
The tracing functionality remains identical - actual trace_protocol from
core
is applied to router implementations at runtime when both telemetry is
enabled
and the protocol has the `__marked_for_tracing__` marker.
## Test Plan
Manual integration test confirms identical behavior to main branch:
```bash
llama stack list-deps --format uv starter | sh
export OLLAMA_URL=http://localhost:11434
llama stack run starter
curl -X POST http://localhost:8321/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "ollama/gpt-oss:20b",
"messages": [{"role": "user", "content": "Say hello"}],
"max_tokens": 10}'
```
Verified identical between main and this branch:
- trace_id present in response
- metrics array with prompt_tokens, completion_tokens, total_tokens
- Server logs show trace_protocol applied to all routers
Existing telemetry integration tests (tests/integration/telemetry/) validate
trace context propagation and span attributes.
relates to #3895
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
- when create vector store is called without chunk strategy, we actually
the strategy used so that the value is persisted instead of
strategy='None'
## Test Plan
updated tests
## What does this PR do?
The starter distribution now comes with all the required packages to
support persistent stores—like the agent store, metadata, and
inference—using PostgreSQL. Users can enable PostgreSQL support by
setting the `ENABLE_POSTGRES_STORE=1` environment variable.
This PR consolidates the functionality from the removed `postgres-demo`
distribution into the starter distribution, reducing maintenance
overhead.
**Closes: #2619**
**Supersedes: #2851** (rebased and updated)
## Changes Made
1. **Added PostgreSQL support to starter distribution**
- New `run-with-postgres-store.yaml` configuration
- Automatic config switching via `ENABLE_POSTGRES_STORE` environment
variable
- Removed separate `postgres-demo` distribution
2. **Updated to new build system**
- Integrated postgres switching logic into Containerfile entrypoint
- Uses new `storage_backends` and `storage_stores` API
- Properly configured both PostgreSQL KV store and SQL store
3. **Updated dependencies**
- Added `psycopg2-binary` and `asyncpg` to starter distribution
- All postgres-related dependencies automatically included
## How to Use
### With Docker (PostgreSQL):
```bash
docker run \
-e ENABLE_POSTGRES_STORE=1 \
-e POSTGRES_HOST=your_postgres_host \
-e POSTGRES_PORT=5432 \
-e POSTGRES_DB=llamastack \
-e POSTGRES_USER=llamastack \
-e POSTGRES_PASSWORD=llamastack \
-e OPENAI_API_KEY=your_key \
llamastack/distribution-starter
```
### PostgreSQL environment variables:
- `POSTGRES_HOST`: Postgres host (default: `localhost`)
- `POSTGRES_PORT`: Postgres port (default: `5432`)
- `POSTGRES_DB`: Postgres database name (default: `llamastack`)
- `POSTGRES_USER`: Postgres username (default: `llamastack`)
- `POSTGRES_PASSWORD`: Postgres password (default: `llamastack`)
## Test Plan
All pre-commit hooks pass (mypy, ruff, distro-codegen)
`llama stack list-deps starter` confirms psycopg2-binary is included
Storage configuration correctly uses PostgreSQL backends
Container builds successfully with postgres support
## Credits
Original work by @leseb in #2851. Rebased and updated by @r-bit-rry to
work with latest main.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Sébastien Han @leseb
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
1. Make telemetry tests as easy as possible for users by expanding the
`SpanStub` data class and creating the `MetricStub` dataclass as a way
to consistently marshal telemetry data in test fixtures and unmarshal
and handle it in tests.
2. Structure server and client tests to always follow the same standards
for consistent testing experience by using the `SpanStub` and
`MetricStub` data class objects.
3. Enable Metrics Testing for completions endpoint
4. Correct token metrics to use histograms instead of counts to capture
tokens per request rather than a cumulative count of tokens over the
lifecycle of the server.
## Test Plan
These are tests
RAG aka file search is implemented via the Responses API by specifying
the file-search tool. The backend implementation remains unchanged. This
PR merely removes the directly exposed API surface which allowed users
to directly perform searches from the client.
This facility is now available via the `client.vector_store.search()`
OpenAI compatible API.
The llama-stack-client now uses /`v1/openai/v1/models` which returns
OpenAI-compatible model objects with 'id' and 'custom_metadata' fields
instead of the Resource-style 'identifier' field. Updated api_recorder
to handle the new endpoint and modified tests to access model metadata
appropriately. Deleted stale model recordings for re-recording.
**NOTE: CI will be red on this one since it is dependent on
https://github.com/llamastack/llama-stack-client-python/pull/291/files
landing. I verified locally that it is green.**
# What does this PR do?
This API hasn't received any traction and close to zero interest from
the community. Let's revisit in the future if things change.
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
We need to remove `/v1/openai/v1` paths shortly. There is one trouble --
our current `/v1/openai/v1/models` endpoint provides different data than
`/v1/models`. Unfortunately our tests target the latter (llama-stack
customized) behavior. We need to get to true OpenAI compatibility.
This is step 1: adding `custom_metadata` field to `OpenAIModel` that
includes all the extra stuff we add in the native `/v1/models` response.
This can be extracted on the consumer end by look at
`__pydantic_extra__` or other similar fields.
This PR:
- Adds `custom_metadata` field to `OpenAIModel` class in
`src/llama_stack/apis/models/models.py`
- Modified `openai_list_models()` in
`src/llama_stack/core/routing_tables/models.py` to populate
custom_metadata
Next Steps
1. Update stainless client to use `/v1/openai/v1/models` instead of
`/v1/models`
2. Migrate tests to read from `custom_metadata`
3. Remove `/v1/openai/v1/` prefix entirely and consolidate to single
`/v1/models` endpoint