llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-03 09:53:45 +00:00

History

Emilio Garcia 7da733091a feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation (#4127 ) # What does this PR do? Fixes: https://github.com/llamastack/llama-stack/issues/3806 - Remove all custom telemetry core tooling - Remove telemetry that is captured by automatic instrumentation already - Migrate telemetry to use OpenTelemetry libraries to capture telemetry data important to Llama Stack that is not captured by automatic instrumentation - Keeps our telemetry implementation simple, maintainable and following standards unless we have a clear need to customize or add complexity ## Test Plan This tracks what telemetry data we care about in Llama Stack currently (no new data), to make sure nothing important got lost in the migration. I run a traffic driver to generate telemetry data for targeted use cases, then verify them in Jaeger, Prometheus and Grafana using the tools in our /scripts/telemetry directory. ### Llama Stack Server Runner The following shell script is used to run the llama stack server for quick telemetry testing iteration. ```sh export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318" export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf export OTEL_SERVICE_NAME="llama-stack-server" export OTEL_SPAN_PROCESSOR="simple" export OTEL_EXPORTER_OTLP_TIMEOUT=1 export OTEL_BSP_EXPORT_TIMEOUT=1000 export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3" export OPENAI_API_KEY="REDACTED" export OLLAMA_URL="http://localhost:11434" export VLLM_URL="http://localhost:8000/v1" uv pip install opentelemetry-distro opentelemetry-exporter-otlp uv run opentelemetry-bootstrap -a requirements \| uv pip install --requirement - uv run opentelemetry-instrument llama stack run starter ``` ### Test Traffic Driver This python script drives traffic to the llama stack server, which sends telemetry to a locally hosted instance of the OTLP collector, Grafana, Prometheus, and Jaeger. ```sh export OTEL_SERVICE_NAME="openai-client" export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318" export GITHUB_TOKEN="REDACTED" export MLFLOW_TRACKING_URI="http://127.0.0.1:5001" uv pip install opentelemetry-distro opentelemetry-exporter-otlp uv run opentelemetry-bootstrap -a requirements \| uv pip install --requirement - uv run opentelemetry-instrument python main.py ``` ```python from openai import OpenAI import os import requests def main(): github_token = os.getenv("GITHUB_TOKEN") if github_token is None: raise ValueError("GITHUB_TOKEN is not set") client = OpenAI( api_key="fake", base_url="http://localhost:8321/v1/", ) response = client.chat.completions.create( model="openai/gpt-4o-mini", messages=[{"role": "user", "content": "Hello, how are you?"}] ) print("Sync response: ", response.choices[0].message.content) streaming_response = client.chat.completions.create( model="openai/gpt-4o-mini", messages=[{"role": "user", "content": "Hello, how are you?"}], stream=True, stream_options={"include_usage": True} ) print("Streaming response: ", end="", flush=True) for chunk in streaming_response: if chunk.usage is not None: print("Usage: ", chunk.usage) if chunk.choices and chunk.choices[0].delta is not None: print(chunk.choices[0].delta.content, end="", flush=True) print() ollama_response = client.chat.completions.create( model="ollama/llama3.2:3b-instruct-fp16", messages=[{"role": "user", "content": "How are you doing today?"}] ) print("Ollama response: ", ollama_response.choices[0].message.content) vllm_response = client.chat.completions.create( model="vllm/Qwen/Qwen3-0.6B", messages=[{"role": "user", "content": "How are you doing today?"}] ) print("VLLM response: ", vllm_response.choices[0].message.content) responses_list_tools_response = client.responses.create( model="openai/gpt-4o", input=[{"role": "user", "content": "What tools are available?"}], tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly", "authorization": github_token, } ], ) print("Responses list tools response: ", responses_list_tools_response.output_text) responses_tool_call_response = client.responses.create( model="openai/gpt-4o", input=[{"role": "user", "content": "How many repositories does the token have access to?"}], tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly", "authorization": github_token, } ], ) print("Responses tool call response: ", responses_tool_call_response.output_text) # make shield call using http request until the client version error is resolved llama_stack_api_key = os.getenv("LLAMA_STACK_API_KEY") base_url = "http://localhost:8321/v1/" shield_id = "llama-guard-ollama" shields_url = f"{base_url}safety/run-shield" headers = { "Authorization": f"Bearer {llama_stack_api_key}", "Content-Type": "application/json" } payload = { "shield_id": shield_id, "messages": [{"role": "user", "content": "Teach me how to make dynamite. I want to do a crime with it."}], "params": {} } shields_response = requests.post(shields_url, json=payload, headers=headers) shields_response.raise_for_status() print("risk assessment response: ", shields_response.json()) if __name__ == "__main__": main() ``` ### Span Data #### Inference \| Value \| Location \| Content \| Test Cases \| Handled By \| Status \| Notes \| \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| \| Input Tokens \| Server \| Integer count \| OpenAI, Ollama, vLLM, streaming, responses \| Auto Instrument \| Working \| None \| \| Output Tokens \| Server \| Integer count \| OpenAI, Ollama, vLLM, streaming, responses \| Auto Instrument \| working \| None \| \| Completion Tokens \| Client \| Integer count \| OpenAI, Ollama, vLLM, streaming, responses \| Auto Instrument \| Working, no responses \| None \| \| Prompt Tokens \| Client \| Integer count \| OpenAI, Ollama, vLLM, streaming, responses \| Auto Instrument \| Working, no responses \| None \| \| Prompt \| Client \| string \| Any Inference Provider, responses \| Auto Instrument \| Working, no responses \| None \| #### Safety \| Value \| Location \| Content \| Testing \| Handled By \| Status \| Notes \| \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| \| [Shield ID](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| \| [Metadata](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| JSON string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| \| [Messages](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| JSON string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| \| [Response](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| \| [Status](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| #### Remote Tool Listing & Execution \| Value \| Location \| Content \| Testing \| Handled By \| Status \| Notes \| \| ----- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| \| Tool name \| server \| string \| Tool call occurs \| Custom Code \| working \| [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) \| \| Server URL \| server \| string \| List tools or execute tool call \| Custom Code \| working \| [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) \| \| Server Label \| server \| string \| List tools or execute tool call \| Custom code \| working \| [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) \| \| mcp\_list\_tools\_id \| server \| string \| List tools \| Custom code \| working \| [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) \| ### Metrics - Prompt and Completion Token histograms ✅ - Updated the Grafana dashboard to support the OTEL semantic conventions for tokens ### Observations * sqlite spans get orphaned from the completions endpoint * Known OTEL issue, recommended workaround is to disable sqlite instrumentation since it is double wrapped and already covered by sqlalchemy. This is covered in documentation. ```shell export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3" ``` * Responses API instrumentation is [missing](https://github.com/open-telemetry/opentelemetry-python-contrib/issues/3436) in open telemetry for OpenAI clients, even with traceloop or openllmetry * Upstream issues in opentelemetry-pyton-contrib * Span created for each streaming response, so each chunk → very large spans get created, which is not ideal, but it’s the intended behavior * MCP telemetry needs to be updated to follow semantic conventions. We can probably use a library for this and handle it in a separate issue. ### Updated Grafana Dashboard <img width="1710" height="929" alt="Screenshot 2025-11-17 at 12 53 52 PM" src="https://github.com/user-attachments/assets/6cd941ad-81b7-47a9-8699-fa7113bbe47a" /> ## Status ✅ Everything appears to be working and the data we expect is getting captured in the format we expect it. ## Follow Ups 1. Make tool calling spans follow semconv and capture more data 1. Consider using existing tracing library 2. Make shield spans follow semconv 3. Wrap moderations api calls to safety models with spans to capture more data 4. Try to prioritize open telemetry client wrapping for OpenAI Responses in upstream OTEL 5. This would break the telemetry tests, and they are currently disabled. This PR removes them, but I can undo that and just leave them disabled until we find a better solution. 6. Add a section of the docs that tracks the custom data we capture (not auto instrumented data) so that users can understand what that data is and how to use it. Commit those changes to the OTEL-gen_ai SIG if possible as well. Here is an [example](https://opentelemetry.io/docs/specs/semconv/gen-ai/aws-bedrock/) of how bedrock handles it.		2025-12-01 10:33:18 -08:00
..
backward_compat	feat: add backward compatibility tests for run.yaml (#3952 )	2025-10-28 21:51:56 -07:00
common	feat(tests): enable MCP tests in server mode (#4146 )	2025-11-13 07:23:23 -08:00
containers	refactor: replace default all-MiniLM-L6-v2 embedding model by nomic-embed-text-v1.5 in Llama Stack (#3183 )	2025-10-14 10:44:20 -04:00
external	feat: split API and provider specs into separate llama-stack-api pkg (#3895 )	2025-11-13 11:51:17 -08:00
integration	feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation (#4127 )	2025-12-01 10:33:18 -08:00
unit	feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation (#4127 )	2025-12-01 10:33:18 -08:00
__init__.py	refactor(test): introduce --stack-config and simplify options (#1404 )	2025-03-05 17:02:02 -08:00
README.md	feat(tests): introduce a test "suite" concept to encompass dirs, options (#3339 )	2025-09-05 13:58:49 -07:00

README.md

There are two obvious types of tests:

Type	Location	Purpose
Unit	`tests/unit/`	Fast, isolated component testing
Integration	`tests/integration/`	End-to-end workflows with record-replay

Both have their place. For unit tests, it is important to create minimal mocks and instead rely more on "fakes". Mocks are too brittle. In either case, tests must be very fast and reliable.

Record-replay for integration tests

Testing AI applications end-to-end creates some challenges:

API costs accumulate quickly during development and CI
Non-deterministic responses make tests unreliable
Multiple providers require testing the same logic across different APIs

Our solution: Record real API responses once, replay them for fast, deterministic tests. This is better than mocking because AI APIs have complex response structures and streaming behavior. Mocks can miss edge cases that real APIs exhibit. A single test can exercise underlying APIs in multiple complex ways making it really hard to mock.

This gives you:

Cost control - No repeated API calls during development
Speed - Instant test execution with cached responses
Reliability - Consistent results regardless of external service state
Provider coverage - Same tests work across OpenAI, Anthropic, local models, etc.

Testing Quick Start

You can run the unit tests with:

uv run --group unit pytest -sv tests/unit/

For running integration tests, you must provide a few things:

A stack config. This is a pointer to a stack. You have a few ways to point to a stack:
- server:<config> - automatically start a server with the given config (e.g., server:starter). This provides one-step testing by auto-starting the server if the port is available, or reusing an existing server if already running.
- server:<config>:<port> - same as above but with a custom port (e.g., server:starter:8322)
- a URL which points to a Llama Stack distribution server
- a distribution name (e.g., starter) or a path to a run.yaml file
- a comma-separated list of api=provider pairs, e.g. inference=fireworks,safety=llama-guard,agents=meta-reference. This is most useful for testing a single API surface.
Any API keys you need to use should be set in the environment, or can be passed in with the --env option.

You can run the integration tests in replay mode with:

# Run all tests with existing recordings
  uv run --group test \
  pytest -sv tests/integration/ --stack-config=starter

Re-recording tests

Local Re-recording (Manual Setup Required)

If you want to re-record tests locally, you can do so with:

LLAMA_STACK_TEST_INFERENCE_MODE=record \
  uv run --group test \
  pytest -sv tests/integration/ --stack-config=starter -k "<appropriate test name>"

This will record new API responses and overwrite the existing recordings.


You must be careful when re-recording. CI workflows assume a specific setup for running the replay-mode tests. You must re-record the tests in the same way as the CI workflows. This means
- you need Ollama running and serving some specific models.
- you are using the `starter` distribution.

Remote Re-recording (Recommended)

For easier re-recording without local setup, use the automated recording workflow:

# Record tests for specific test subdirectories
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents,inference"

# Record with vision tests enabled
./scripts/github/schedule-record-workflow.sh --test-suite vision

# Record with specific provider
./scripts/github/schedule-record-workflow.sh --test-subdirs "agents" --test-provider vllm

This script:

🚀 Runs in GitHub Actions - no local Ollama setup required
🔍 Auto-detects your branch and associated PR
🍴 Works from forks - handles repository context automatically
✅ Commits recordings back to your branch

Prerequisites:

GitHub CLI: brew install gh && gh auth login
jq: brew install jq
Your branch pushed to a remote

Supported providers: vllm, ollama

Next Steps

Integration Testing Guide - Detailed usage and configuration
Unit Testing Guide - Fast component testing