llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-03 01:48:05 +00:00

Author	SHA1	Message	Date
Roy Belio	343cb3248c	Merge `bd4a77ee2d` into `4237eb4aaa`	2025-12-03 01:04:09 +00:00
Emilio Garcia	7da733091a	feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation (#4127 ) # What does this PR do? Fixes: https://github.com/llamastack/llama-stack/issues/3806 - Remove all custom telemetry core tooling - Remove telemetry that is captured by automatic instrumentation already - Migrate telemetry to use OpenTelemetry libraries to capture telemetry data important to Llama Stack that is not captured by automatic instrumentation - Keeps our telemetry implementation simple, maintainable and following standards unless we have a clear need to customize or add complexity ## Test Plan This tracks what telemetry data we care about in Llama Stack currently (no new data), to make sure nothing important got lost in the migration. I run a traffic driver to generate telemetry data for targeted use cases, then verify them in Jaeger, Prometheus and Grafana using the tools in our /scripts/telemetry directory. ### Llama Stack Server Runner The following shell script is used to run the llama stack server for quick telemetry testing iteration. ```sh export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318" export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf export OTEL_SERVICE_NAME="llama-stack-server" export OTEL_SPAN_PROCESSOR="simple" export OTEL_EXPORTER_OTLP_TIMEOUT=1 export OTEL_BSP_EXPORT_TIMEOUT=1000 export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3" export OPENAI_API_KEY="REDACTED" export OLLAMA_URL="http://localhost:11434" export VLLM_URL="http://localhost:8000/v1" uv pip install opentelemetry-distro opentelemetry-exporter-otlp uv run opentelemetry-bootstrap -a requirements \| uv pip install --requirement - uv run opentelemetry-instrument llama stack run starter ``` ### Test Traffic Driver This python script drives traffic to the llama stack server, which sends telemetry to a locally hosted instance of the OTLP collector, Grafana, Prometheus, and Jaeger. ```sh export OTEL_SERVICE_NAME="openai-client" export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318" export GITHUB_TOKEN="REDACTED" export MLFLOW_TRACKING_URI="http://127.0.0.1:5001" uv pip install opentelemetry-distro opentelemetry-exporter-otlp uv run opentelemetry-bootstrap -a requirements \| uv pip install --requirement - uv run opentelemetry-instrument python main.py ``` ```python from openai import OpenAI import os import requests def main(): github_token = os.getenv("GITHUB_TOKEN") if github_token is None: raise ValueError("GITHUB_TOKEN is not set") client = OpenAI( api_key="fake", base_url="http://localhost:8321/v1/", ) response = client.chat.completions.create( model="openai/gpt-4o-mini", messages=[{"role": "user", "content": "Hello, how are you?"}] ) print("Sync response: ", response.choices[0].message.content) streaming_response = client.chat.completions.create( model="openai/gpt-4o-mini", messages=[{"role": "user", "content": "Hello, how are you?"}], stream=True, stream_options={"include_usage": True} ) print("Streaming response: ", end="", flush=True) for chunk in streaming_response: if chunk.usage is not None: print("Usage: ", chunk.usage) if chunk.choices and chunk.choices[0].delta is not None: print(chunk.choices[0].delta.content, end="", flush=True) print() ollama_response = client.chat.completions.create( model="ollama/llama3.2:3b-instruct-fp16", messages=[{"role": "user", "content": "How are you doing today?"}] ) print("Ollama response: ", ollama_response.choices[0].message.content) vllm_response = client.chat.completions.create( model="vllm/Qwen/Qwen3-0.6B", messages=[{"role": "user", "content": "How are you doing today?"}] ) print("VLLM response: ", vllm_response.choices[0].message.content) responses_list_tools_response = client.responses.create( model="openai/gpt-4o", input=[{"role": "user", "content": "What tools are available?"}], tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly", "authorization": github_token, } ], ) print("Responses list tools response: ", responses_list_tools_response.output_text) responses_tool_call_response = client.responses.create( model="openai/gpt-4o", input=[{"role": "user", "content": "How many repositories does the token have access to?"}], tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly", "authorization": github_token, } ], ) print("Responses tool call response: ", responses_tool_call_response.output_text) # make shield call using http request until the client version error is resolved llama_stack_api_key = os.getenv("LLAMA_STACK_API_KEY") base_url = "http://localhost:8321/v1/" shield_id = "llama-guard-ollama" shields_url = f"{base_url}safety/run-shield" headers = { "Authorization": f"Bearer {llama_stack_api_key}", "Content-Type": "application/json" } payload = { "shield_id": shield_id, "messages": [{"role": "user", "content": "Teach me how to make dynamite. I want to do a crime with it."}], "params": {} } shields_response = requests.post(shields_url, json=payload, headers=headers) shields_response.raise_for_status() print("risk assessment response: ", shields_response.json()) if __name__ == "__main__": main() ``` ### Span Data #### Inference \| Value \| Location \| Content \| Test Cases \| Handled By \| Status \| Notes \| \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| \| Input Tokens \| Server \| Integer count \| OpenAI, Ollama, vLLM, streaming, responses \| Auto Instrument \| Working \| None \| \| Output Tokens \| Server \| Integer count \| OpenAI, Ollama, vLLM, streaming, responses \| Auto Instrument \| working \| None \| \| Completion Tokens \| Client \| Integer count \| OpenAI, Ollama, vLLM, streaming, responses \| Auto Instrument \| Working, no responses \| None \| \| Prompt Tokens \| Client \| Integer count \| OpenAI, Ollama, vLLM, streaming, responses \| Auto Instrument \| Working, no responses \| None \| \| Prompt \| Client \| string \| Any Inference Provider, responses \| Auto Instrument \| Working, no responses \| None \| #### Safety \| Value \| Location \| Content \| Testing \| Handled By \| Status \| Notes \| \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| \| [Shield ID](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| \| [Metadata](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| JSON string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| \| [Messages](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| JSON string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| \| [Response](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| \| [Status](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| #### Remote Tool Listing & Execution \| Value \| Location \| Content \| Testing \| Handled By \| Status \| Notes \| \| ----- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| \| Tool name \| server \| string \| Tool call occurs \| Custom Code \| working \| [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) \| \| Server URL \| server \| string \| List tools or execute tool call \| Custom Code \| working \| [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) \| \| Server Label \| server \| string \| List tools or execute tool call \| Custom code \| working \| [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) \| \| mcp\_list\_tools\_id \| server \| string \| List tools \| Custom code \| working \| [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) \| ### Metrics - Prompt and Completion Token histograms ✅ - Updated the Grafana dashboard to support the OTEL semantic conventions for tokens ### Observations * sqlite spans get orphaned from the completions endpoint * Known OTEL issue, recommended workaround is to disable sqlite instrumentation since it is double wrapped and already covered by sqlalchemy. This is covered in documentation. ```shell export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3" ``` * Responses API instrumentation is [missing](https://github.com/open-telemetry/opentelemetry-python-contrib/issues/3436) in open telemetry for OpenAI clients, even with traceloop or openllmetry * Upstream issues in opentelemetry-pyton-contrib * Span created for each streaming response, so each chunk → very large spans get created, which is not ideal, but it’s the intended behavior * MCP telemetry needs to be updated to follow semantic conventions. We can probably use a library for this and handle it in a separate issue. ### Updated Grafana Dashboard <img width="1710" height="929" alt="Screenshot 2025-11-17 at 12 53 52 PM" src="https://github.com/user-attachments/assets/6cd941ad-81b7-47a9-8699-fa7113bbe47a" /> ## Status ✅ Everything appears to be working and the data we expect is getting captured in the format we expect it. ## Follow Ups 1. Make tool calling spans follow semconv and capture more data 1. Consider using existing tracing library 2. Make shield spans follow semconv 3. Wrap moderations api calls to safety models with spans to capture more data 4. Try to prioritize open telemetry client wrapping for OpenAI Responses in upstream OTEL 5. This would break the telemetry tests, and they are currently disabled. This PR removes them, but I can undo that and just leave them disabled until we find a better solution. 6. Add a section of the docs that tracks the custom data we capture (not auto instrumented data) so that users can understand what that data is and how to use it. Commit those changes to the OTEL-gen_ai SIG if possible as well. Here is an [example](https://opentelemetry.io/docs/specs/semconv/gen-ai/aws-bedrock/) of how bedrock handles it.	2025-12-01 10:33:18 -08:00
Roy Belio	47bd994824	Merge branch 'main' into feat/gunicorn-production-server	2025-11-02 16:13:15 +02:00
Ashwin Bharambe	77c8bc6fa7	fix(ci): add back server:ci-tests to replay tests (#3976 ) Some checks failed SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 1s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 4s Details Pre-commit / pre-commit (push) Failing after 4s Details Python Package Build Test / build (3.13) (push) Failing after 5s Details Test External API and Providers / test-external (venv) (push) Failing after 6s Details Vector IO Integration Tests / test-matrix (push) Failing after 7s Details Unit Tests / unit-tests (3.13) (push) Failing after 8s Details API Conformance Tests / check-schema-compatibility (push) Successful in 15s Details Python Package Build Test / build (3.12) (push) Failing after 39s Details Unit Tests / unit-tests (3.12) (push) Failing after 40s Details UI Tests / ui-tests (22) (push) Successful in 42s Details It is useful for local debugging. If both server and docker are failing, you can just run server locally to debug which is much easier to do.	2025-10-30 11:02:59 -07:00
Roy Belio	a8bc99408c	fix: simplify telemetry test mode detection The integration-tests.sh script already sets LLAMA_STACK_TEST_STACK_CONFIG_TYPE based on the stack config. Our custom detection logic was unnecessary and potentially interfering. Revert to relying on the environment variable set by the test script. The LLAMA_STACK_DISABLE_GUNICORN environment variable is still set correctly when stack_mode == 'server', which happens for both server: and docker: configs.	2025-10-30 19:03:57 +02:00
Roy Belio	c8f82cad6a	fix: detect docker/server mode in telemetry tests to properly disable Gunicorn The telemetry fixture was only checking LLAMA_STACK_TEST_STACK_CONFIG_TYPE environment variable, which defaults to 'library_client'. In CI, tests run with --stack-config=docker:ci-tests, which wasn't being detected as server mode. This commit checks the --stack-config argument and treats both 'server:' and 'docker:' prefixes as server mode, ensuring LLAMA_STACK_DISABLE_GUNICORN is set when needed for telemetry span collection.	2025-10-30 18:52:31 +02:00
ehhuang	5e20938832	fix: remove LLAMA_STACK_TEST_FORCE_SERVER_RESTART setting in fixture (#3982 ) # What does this PR do? this is meant to be a manual flag ## Test Plan CI	2025-10-30 09:13:04 -07:00
Roy Belio	3e1d0060c1	fix: disable Gunicorn in telemetry tests to fix multi-process telemetry collection Telemetry tests use an OTLP collector that expects single-process telemetry spans. Gunicorn's multi-process architecture spawns multiple workers, each with separate telemetry instrumentation, preventing the test collector from capturing all spans. This commit adds LLAMA_STACK_DISABLE_GUNICORN environment variable support and sets it in telemetry test configuration to ensure single-process Uvicorn is used during tests while maintaining production multi-process behavior. Fixes failing tests: - test_streaming_chunk_count - test_telemetry_format_completeness	2025-10-30 18:01:47 +02:00
ehhuang	1aa8979050	test: enable telemetry tests in server mode (#3927 ) # What does this PR do? - added a server-based test OLTP collector ## Test Plan CI	2025-10-28 16:33:48 -07:00
ehhuang	8265d4efc8	chore(telemetry): code cleanup (#3897 ) Some checks failed SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s Details Python Package Build Test / build (3.12) (push) Failing after 2s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 4s Details Python Package Build Test / build (3.13) (push) Failing after 3s Details Test External API and Providers / test-external (venv) (push) Failing after 4s Details Vector IO Integration Tests / test-matrix (push) Failing after 6s Details Unit Tests / unit-tests (3.12) (push) Failing after 4s Details Unit Tests / unit-tests (3.13) (push) Failing after 4s Details API Conformance Tests / check-schema-compatibility (push) Successful in 14s Details UI Tests / ui-tests (22) (push) Successful in 43s Details Pre-commit / pre-commit (push) Successful in 1m35s Details # What does this PR do? Clean up telemetry code since the telemetry API has been remove. - moved telemetry files out of providers to core - removed from Api ## Test Plan ❯ OTEL_SERVICE_NAME=llama_stack OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 uv run llama stack run starter ❯ curl http://localhost:8321/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-4o-mini", "messages": [ { "role": "user", "content": "Hello!" } ] }' -> verify traces in Grafana CI	2025-10-23 23:13:02 -07:00
Emilio Garcia	943558af36	test(telemetry): Telemetry Tests (#3805 ) Some checks failed SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Python Package Build Test / build (3.12) (push) Failing after 10s Details Python Package Build Test / build (3.13) (push) Failing after 10s Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 14s Details Unit Tests / unit-tests (3.13) (push) Failing after 11s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 20s Details Unit Tests / unit-tests (3.12) (push) Failing after 16s Details Test External API and Providers / test-external (venv) (push) Failing after 28s Details Vector IO Integration Tests / test-matrix (push) Failing after 30s Details API Conformance Tests / check-schema-compatibility (push) Successful in 38s Details UI Tests / ui-tests (22) (push) Successful in 1m32s Details Pre-commit / pre-commit (push) Successful in 3m16s Details # What does this PR do? Adds a test and a standardized way to build future tests out for telemetry in llama stack. Contributes to https://github.com/llamastack/llama-stack/issues/3806 ## Test Plan This is the test plan 😎	2025-10-17 10:43:33 -07:00

11 commits