# What does this PR do?
Fixes: https://github.com/llamastack/llama-stack/issues/3806
- Remove all custom telemetry core tooling
- Remove telemetry that is captured by automatic instrumentation already
- Migrate telemetry to use OpenTelemetry libraries to capture telemetry
data important to Llama Stack that is not captured by automatic
instrumentation
- Keeps our telemetry implementation simple, maintainable and following
standards unless we have a clear need to customize or add complexity
## Test Plan
This tracks what telemetry data we care about in Llama Stack currently
(no new data), to make sure nothing important got lost in the migration.
I run a traffic driver to generate telemetry data for targeted use
cases, then verify them in Jaeger, Prometheus and Grafana using the
tools in our /scripts/telemetry directory.
### Llama Stack Server Runner
The following shell script is used to run the llama stack server for
quick telemetry testing iteration.
```sh
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_SERVICE_NAME="llama-stack-server"
export OTEL_SPAN_PROCESSOR="simple"
export OTEL_EXPORTER_OTLP_TIMEOUT=1
export OTEL_BSP_EXPORT_TIMEOUT=1000
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"
export OPENAI_API_KEY="REDACTED"
export OLLAMA_URL="http://localhost:11434"
export VLLM_URL="http://localhost:8000/v1"
uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument llama stack run starter
```
### Test Traffic Driver
This python script drives traffic to the llama stack server, which sends
telemetry to a locally hosted instance of the OTLP collector, Grafana,
Prometheus, and Jaeger.
```sh
export OTEL_SERVICE_NAME="openai-client"
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318"
export GITHUB_TOKEN="REDACTED"
export MLFLOW_TRACKING_URI="http://127.0.0.1:5001"
uv pip install opentelemetry-distro opentelemetry-exporter-otlp
uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement -
uv run opentelemetry-instrument python main.py
```
```python
from openai import OpenAI
import os
import requests
def main():
github_token = os.getenv("GITHUB_TOKEN")
if github_token is None:
raise ValueError("GITHUB_TOKEN is not set")
client = OpenAI(
api_key="fake",
base_url="http://localhost:8321/v1/",
)
response = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Hello, how are you?"}]
)
print("Sync response: ", response.choices[0].message.content)
streaming_response = client.chat.completions.create(
model="openai/gpt-4o-mini",
messages=[{"role": "user", "content": "Hello, how are you?"}],
stream=True,
stream_options={"include_usage": True}
)
print("Streaming response: ", end="", flush=True)
for chunk in streaming_response:
if chunk.usage is not None:
print("Usage: ", chunk.usage)
if chunk.choices and chunk.choices[0].delta is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
ollama_response = client.chat.completions.create(
model="ollama/llama3.2:3b-instruct-fp16",
messages=[{"role": "user", "content": "How are you doing today?"}]
)
print("Ollama response: ", ollama_response.choices[0].message.content)
vllm_response = client.chat.completions.create(
model="vllm/Qwen/Qwen3-0.6B",
messages=[{"role": "user", "content": "How are you doing today?"}]
)
print("VLLM response: ", vllm_response.choices[0].message.content)
responses_list_tools_response = client.responses.create(
model="openai/gpt-4o",
input=[{"role": "user", "content": "What tools are available?"}],
tools=[
{
"type": "mcp",
"server_label": "github",
"server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
"authorization": github_token,
}
],
)
print("Responses list tools response: ", responses_list_tools_response.output_text)
responses_tool_call_response = client.responses.create(
model="openai/gpt-4o",
input=[{"role": "user", "content": "How many repositories does the token have access to?"}],
tools=[
{
"type": "mcp",
"server_label": "github",
"server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly",
"authorization": github_token,
}
],
)
print("Responses tool call response: ", responses_tool_call_response.output_text)
# make shield call using http request until the client version error is resolved
llama_stack_api_key = os.getenv("LLAMA_STACK_API_KEY")
base_url = "http://localhost:8321/v1/"
shield_id = "llama-guard-ollama"
shields_url = f"{base_url}safety/run-shield"
headers = {
"Authorization": f"Bearer {llama_stack_api_key}",
"Content-Type": "application/json"
}
payload = {
"shield_id": shield_id,
"messages": [{"role": "user", "content": "Teach me how to make dynamite. I want to do a crime with it."}],
"params": {}
}
shields_response = requests.post(shields_url, json=payload, headers=headers)
shields_response.raise_for_status()
print("risk assessment response: ", shields_response.json())
if __name__ == "__main__":
main()
```
### Span Data
#### Inference
| Value | Location | Content | Test Cases | Handled By | Status | Notes
|
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Input Tokens | Server | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | Working | None |
| Output Tokens | Server | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | working | None |
| Completion Tokens | Client | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | Working, no responses | None |
| Prompt Tokens | Client | Integer count | OpenAI, Ollama, vLLM,
streaming, responses | Auto Instrument | Working, no responses | None |
| Prompt | Client | string | Any Inference Provider, responses | Auto
Instrument | Working, no responses | None |
#### Safety
| Value | Location | Content | Testing | Handled By | Status | Notes |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| [Shield
ID](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | string | Llama-guard shield call | Custom Code | Working |
Not Following Semconv |
|
[Metadata](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | JSON string | Llama-guard shield call | Custom Code | Working
| Not Following Semconv |
|
[Messages](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | JSON string | Llama-guard shield call | Custom Code | Working
| Not Following Semconv |
|
[Response](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | string | Llama-guard shield call | Custom Code | Working |
Not Following Semconv |
|
[Status](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py)
| Server | string | Llama-guard shield call | Custom Code | Working |
Not Following Semconv |
#### Remote Tool Listing & Execution
| Value | Location | Content | Testing | Handled By | Status | Notes |
| ----- | :---: | :---: | :---: | :---: | :---: | :---: |
| Tool name | server | string | Tool call occurs | Custom Code | working
| [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
| Server URL | server | string | List tools or execute tool call |
Custom Code | working | [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
| Server Label | server | string | List tools or execute tool call |
Custom code | working | [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
| mcp\_list\_tools\_id | server | string | List tools | Custom code |
working | [Not following
semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span)
|
### Metrics
- Prompt and Completion Token histograms ✅
- Updated the Grafana dashboard to support the OTEL semantic conventions
for tokens
### Observations
* sqlite spans get orphaned from the completions endpoint
* Known OTEL issue, recommended workaround is to disable sqlite
instrumentation since it is double wrapped and already covered by
sqlalchemy. This is covered in documentation.
```shell
export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3"
```
* Responses API instrumentation is
[missing](https://github.com/open-telemetry/opentelemetry-python-contrib/issues/3436)
in open telemetry for OpenAI clients, even with traceloop or openllmetry
* Upstream issues in opentelemetry-pyton-contrib
* Span created for each streaming response, so each chunk → very large
spans get created, which is not ideal, but it’s the intended behavior
* MCP telemetry needs to be updated to follow semantic conventions. We
can probably use a library for this and handle it in a separate issue.
### Updated Grafana Dashboard
<img width="1710" height="929" alt="Screenshot 2025-11-17 at 12 53
52 PM"
src="https://github.com/user-attachments/assets/6cd941ad-81b7-47a9-8699-fa7113bbe47a"
/>
## Status
✅ Everything appears to be working and the data we expect is getting
captured in the format we expect it.
## Follow Ups
1. Make tool calling spans follow semconv and capture more data
1. Consider using existing tracing library
2. Make shield spans follow semconv
3. Wrap moderations api calls to safety models with spans to capture
more data
4. Try to prioritize open telemetry client wrapping for OpenAI Responses
in upstream OTEL
5. This would break the telemetry tests, and they are currently
disabled. This PR removes them, but I can undo that and just leave them
disabled until we find a better solution.
6. Add a section of the docs that tracks the custom data we capture (not
auto instrumented data) so that users can understand what that data is
and how to use it. Commit those changes to the OTEL-gen_ai SIG if
possible as well. Here is an
[example](https://opentelemetry.io/docs/specs/semconv/gen-ai/aws-bedrock/)
of how bedrock handles it.
# Fix for Issue #3797
## Problem
Vector store search failed with Pydantic ValidationError when chunk
metadata contained list-type values.
**Error:**
```
ValidationError: 3 validation errors for VectorStoreSearchResponse
attributes.tags.str: Input should be a valid string
attributes.tags.float: Input should be a valid number
attributes.tags.bool: Input should be a valid boolean
```
**Root Cause:**
- `Chunk.metadata` accepts `dict[str, Any]` (any type allowed)
- `VectorStoreSearchResponse.attributes` requires `dict[str, str | float
| bool]` (primitives only)
- Direct assignment at line 641 caused validation failure for
non-primitive types
## Solution
Added utility function to filter metadata to primitive types before
creating search response.
## Impact
**Fixed:**
- Vector search works with list metadata (e.g., `tags: ["transformers",
"gpu"]`)
- Lists become searchable as comma-separated strings
- No ValidationError on search responses
**Preserved:**
- Full metadata still available in `VectorStoreContent.metadata`
- No API schema changes
- Backward compatible with existing primitive metadata
**Affected:**
All vector store providers using `OpenAIVectorStoreMixin`: FAISS,
Chroma, Qdrant, Milvus, Weaviate, PGVector, SQLite-vec
## Testing
tests/unit/providers/vector_io/test_vector_utils.py::test_sanitize_metadata_for_attributes
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
Initial PR against #4123
Adds `parallel_tool_calls` spec to Responses API and basic initial
implementation where no more than one function call is generated when
set to `False`.
## Test Plan
* Unit tests have been added to verify no more than one function call is
generated.
* A followup PR will verify passing through `parallel_tool_calls` to
providers.
* A followup PR will address verification and/or implementation of
incremental function calling across multiple conversational turns.
---------
Signed-off-by: Anastas Stoyanovsky <astoyano@redhat.com>
FastAPI generator now only unwraps body params explicitly marked with
Body(embed=False) so the /eval run_eval schema once again exposes
RunEvalRequest, matching our integration tests and the server's request
parsing.
Regenerated the OpenAPI specs to capture the restored wrapper.
CI on the Stainless preview builds should be green.
# What does this PR do?
This replaces the legacy "pyopenapi + strong_typing" pipeline with a
FastAPI-backed generator that has an explicit schema registry inside
`llama_stack_api`. The key changes:
1. **New generator architecture.** FastAPI now builds the OpenAPI schema
directly from the real routes, while helper modules
(`schema_collection`, `endpoints`, `schema_transforms`, etc.)
post-process the result. The old pyopenapi stack and its strong_typing
helpers are removed entirely, so we no longer rely on fragile AST
analysis or top-level import side effects.
2. **Schema registry in `llama_stack_api`.** `schema_utils.py` keeps a
`SchemaInfo` record for every `@json_schema_type`, `register_schema`,
and dynamically created request model. The OpenAPI generator and other
tooling query this registry instead of scanning the package tree,
producing deterministic names (e.g., `{MethodName}Request`), capturing
all optional/nullable fields, and making schema discovery testable. A
new unit test covers the registry behavior.
3. **Regenerated specs + CI alignment.** All docs/Stainless specs are
regenerated from the new pipeline, so optional/nullable fields now match
reality (expect the API Conformance workflow to report breaking
changes—this PR establishes the new baseline). The workflow itself is
back to the stock oasdiff invocation so future regressions surface
normally.
*Conformance will be RED on this PR; we choose to accept the
deviations.*
## Test Plan
- `uv run pytest tests/unit/server/test_schema_registry.py`
- `uv run python -m scripts.openapi_generator.main docs/static`
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Mark all register_* / unregister_* APIs as deprecated across models,
shields, tool groups, datasets, benchmarks, and scoring functions. This
is the first step toward moving resource mutations to an `/admin`
namespace as outlined in
https://github.com/llamastack/llama-stack/issues/3809#issuecomment-3492931585.
The deprecation flag will be reflected in the OpenAPI schema to warn API
users that these endpoints are being phased out. Next step will be
implementing the `/admin` route namespace for these resource management
operations.
- `register_model` / `unregister_model`
- `register_shield` / `unregister_shield`
- `register_tool_group` / `unregister_toolgroup`
- `register_dataset` / `unregister_dataset`
- `register_benchmark` / `unregister_benchmark`
- `register_scoring_function` / `unregister_scoring_function`
- Removes the deprecated agents (sessions and turns) API that was marked
alpha in 0.3.0
- Cleans up unused imports and orphaned types after the API removal
- Removes `SessionNotFoundError` and `AgentTurnInputType` which are no
longer needed
The agents API is completely superseded by the Responses + Conversations
APIs, and the client SDK Agent class already uses those implementations.
Corresponding client-side PR:
https://github.com/llamastack/llama-stack-client-python/pull/295
This patch ensures if max tokens is not defined, then is set to None
instead of 0 when calling openai_chat_completion. This way some
providers (like gemini) that cannot handle the `max_tokens = 0` will not
fail
Issue: #3666
# What does this PR do?
As discussed on discord, we do not need to reinvent the wheel for
telemetry. Instead we'll lean into the canonical OTEL stack.
Logs/traces/metrics will still be sent via OTEL - they just won't be
stored on, queried through Stack.
This is the first of many PRs to remove telemetry API from Stack.
1) removed webmethod decorators to remove from API spec
2) removed tests as @iamemilio is adding them on otel directly.
## Test Plan
This is a sweeping change to clean up some gunk around our "Tool"
definitions.
First, we had two types `Tool` and `ToolDef`. The first of these was a
"Resource" type for the registry but we had stopped registering tools
inside the Registry long back (and only registered ToolGroups.) The
latter was for specifying tools for the Agents API. This PR removes the
former and adds an optional `toolgroup_id` field to the latter.
Secondly, as pointed out by @bbrowning in
https://github.com/llamastack/llama-stack/pull/3003#issuecomment-3245270132,
we were doing a lossy conversion from a full JSON schema from the MCP
tool specification into our ToolDefinition to send it to the model.
There is no necessity to do this -- we ourselves aren't doing any
execution at all but merely passing it to the chat completions API which
supports this. By doing this (and by doing it poorly), we encountered
limitations like not supporting array items, or not resolving $refs,
etc.
To fix this, we replaced the `parameters` field by `{ input_schema,
output_schema }` which can be full blown JSON schemas.
Finally, there were some types in our llama-related chat format
conversion which needed some cleanup. We are taking this opportunity to
clean those up.
This PR is a substantial breaking change to the API. However, given our
window for introducing breaking changes, this suits us just fine. I will
be landing a concurrent `llama-stack-client` change as well since API
shapes are changing.
# What does this PR do?
Adds supplementary static content to root API spec pages. This is useful for giving context behind a specific API group, adding information on supported features or work in progress, etc.
This PR introduces supplementary information for Agents (experimental, deprecated) and Responses (stable) APIs.
<!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Documentation server renders rich static content for the Agents API group:

<!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* -->
# What does this PR do?
First step towards cleaning up the API reference section of the docs.
- Separates API reference into 3 sections: stable (`v1`), experimental (`v1alpha` and `v1beta`), and deprecated (`deprecated=True`)
- Each section is accessible via the dropdown menu and `docs/api-overview`
<img width="1237" height="321" alt="Screenshot 2025-09-30 at 5 47 30 PM" src="https://github.com/user-attachments/assets/fe0e498c-b066-46ed-a48e-4739d3b6724c" />
<img width="860" height="510" alt="Screenshot 2025-09-30 at 5 47 49 PM" src="https://github.com/user-attachments/assets/a92a8d8c-94bf-42d5-9f5b-b47bb2b14f9c" />
- Deprecated APIs: Added styling to the sidebar, and a notice on the endpoint pages
<img width="867" height="428" alt="Screenshot 2025-09-30 at 5 47 43 PM" src="https://github.com/user-attachments/assets/9e6e050d-c782-461b-8084-5ff6496d7bd9" />
Closes#3628
TODO in follow-up PRs:
- Add the ability to annotate API groups with supplementary content (so we can have longer descriptions of complex APIs like Responses)
- Clean up docstrings to show API endpoints (or short semantic titles) in the sidebar
## Test Plan
- Local testing
- Made sure API conformance test still passes