# What does this PR do? Fixes: https://github.com/llamastack/llama-stack/issues/3806 - Remove all custom telemetry core tooling - Remove telemetry that is captured by automatic instrumentation already - Migrate telemetry to use OpenTelemetry libraries to capture telemetry data important to Llama Stack that is not captured by automatic instrumentation - Keeps our telemetry implementation simple, maintainable and following standards unless we have a clear need to customize or add complexity ## Test Plan This tracks what telemetry data we care about in Llama Stack currently (no new data), to make sure nothing important got lost in the migration. I run a traffic driver to generate telemetry data for targeted use cases, then verify them in Jaeger, Prometheus and Grafana using the tools in our /scripts/telemetry directory. ### Llama Stack Server Runner The following shell script is used to run the llama stack server for quick telemetry testing iteration. ```sh export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318" export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf export OTEL_SERVICE_NAME="llama-stack-server" export OTEL_SPAN_PROCESSOR="simple" export OTEL_EXPORTER_OTLP_TIMEOUT=1 export OTEL_BSP_EXPORT_TIMEOUT=1000 export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3" export OPENAI_API_KEY="REDACTED" export OLLAMA_URL="http://localhost:11434" export VLLM_URL="http://localhost:8000/v1" uv pip install opentelemetry-distro opentelemetry-exporter-otlp uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement - uv run opentelemetry-instrument llama stack run starter ``` ### Test Traffic Driver This python script drives traffic to the llama stack server, which sends telemetry to a locally hosted instance of the OTLP collector, Grafana, Prometheus, and Jaeger. ```sh export OTEL_SERVICE_NAME="openai-client" export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318" export GITHUB_TOKEN="REDACTED" export MLFLOW_TRACKING_URI="http://127.0.0.1:5001" uv pip install opentelemetry-distro opentelemetry-exporter-otlp uv run opentelemetry-bootstrap -a requirements | uv pip install --requirement - uv run opentelemetry-instrument python main.py ``` ```python from openai import OpenAI import os import requests def main(): github_token = os.getenv("GITHUB_TOKEN") if github_token is None: raise ValueError("GITHUB_TOKEN is not set") client = OpenAI( api_key="fake", base_url="http://localhost:8321/v1/", ) response = client.chat.completions.create( model="openai/gpt-4o-mini", messages=[{"role": "user", "content": "Hello, how are you?"}] ) print("Sync response: ", response.choices[0].message.content) streaming_response = client.chat.completions.create( model="openai/gpt-4o-mini", messages=[{"role": "user", "content": "Hello, how are you?"}], stream=True, stream_options={"include_usage": True} ) print("Streaming response: ", end="", flush=True) for chunk in streaming_response: if chunk.usage is not None: print("Usage: ", chunk.usage) if chunk.choices and chunk.choices[0].delta is not None: print(chunk.choices[0].delta.content, end="", flush=True) print() ollama_response = client.chat.completions.create( model="ollama/llama3.2:3b-instruct-fp16", messages=[{"role": "user", "content": "How are you doing today?"}] ) print("Ollama response: ", ollama_response.choices[0].message.content) vllm_response = client.chat.completions.create( model="vllm/Qwen/Qwen3-0.6B", messages=[{"role": "user", "content": "How are you doing today?"}] ) print("VLLM response: ", vllm_response.choices[0].message.content) responses_list_tools_response = client.responses.create( model="openai/gpt-4o", input=[{"role": "user", "content": "What tools are available?"}], tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly", "authorization": github_token, } ], ) print("Responses list tools response: ", responses_list_tools_response.output_text) responses_tool_call_response = client.responses.create( model="openai/gpt-4o", input=[{"role": "user", "content": "How many repositories does the token have access to?"}], tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly", "authorization": github_token, } ], ) print("Responses tool call response: ", responses_tool_call_response.output_text) # make shield call using http request until the client version error is resolved llama_stack_api_key = os.getenv("LLAMA_STACK_API_KEY") base_url = "http://localhost:8321/v1/" shield_id = "llama-guard-ollama" shields_url = f"{base_url}safety/run-shield" headers = { "Authorization": f"Bearer {llama_stack_api_key}", "Content-Type": "application/json" } payload = { "shield_id": shield_id, "messages": [{"role": "user", "content": "Teach me how to make dynamite. I want to do a crime with it."}], "params": {} } shields_response = requests.post(shields_url, json=payload, headers=headers) shields_response.raise_for_status() print("risk assessment response: ", shields_response.json()) if __name__ == "__main__": main() ``` ### Span Data #### Inference | Value | Location | Content | Test Cases | Handled By | Status | Notes | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Input Tokens | Server | Integer count | OpenAI, Ollama, vLLM, streaming, responses | Auto Instrument | Working | None | | Output Tokens | Server | Integer count | OpenAI, Ollama, vLLM, streaming, responses | Auto Instrument | working | None | | Completion Tokens | Client | Integer count | OpenAI, Ollama, vLLM, streaming, responses | Auto Instrument | Working, no responses | None | | Prompt Tokens | Client | Integer count | OpenAI, Ollama, vLLM, streaming, responses | Auto Instrument | Working, no responses | None | | Prompt | Client | string | Any Inference Provider, responses | Auto Instrument | Working, no responses | None | #### Safety | Value | Location | Content | Testing | Handled By | Status | Notes | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | [Shield ID](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py) | Server | string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | | [Metadata](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py) | Server | JSON string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | | [Messages](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py) | Server | JSON string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | | [Response](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py) | Server | string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | | [Status](ecdfecb9f0/src/llama_stack/core/telemetry/constants.py) | Server | string | Llama-guard shield call | Custom Code | Working | Not Following Semconv | #### Remote Tool Listing & Execution | Value | Location | Content | Testing | Handled By | Status | Notes | | ----- | :---: | :---: | :---: | :---: | :---: | :---: | | Tool name | server | string | Tool call occurs | Custom Code | working | [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) | | Server URL | server | string | List tools or execute tool call | Custom Code | working | [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) | | Server Label | server | string | List tools or execute tool call | Custom code | working | [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) | | mcp\_list\_tools\_id | server | string | List tools | Custom code | working | [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) | ### Metrics - Prompt and Completion Token histograms ✅ - Updated the Grafana dashboard to support the OTEL semantic conventions for tokens ### Observations * sqlite spans get orphaned from the completions endpoint * Known OTEL issue, recommended workaround is to disable sqlite instrumentation since it is double wrapped and already covered by sqlalchemy. This is covered in documentation. ```shell export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3" ``` * Responses API instrumentation is [missing](https://github.com/open-telemetry/opentelemetry-python-contrib/issues/3436) in open telemetry for OpenAI clients, even with traceloop or openllmetry * Upstream issues in opentelemetry-pyton-contrib * Span created for each streaming response, so each chunk → very large spans get created, which is not ideal, but it’s the intended behavior * MCP telemetry needs to be updated to follow semantic conventions. We can probably use a library for this and handle it in a separate issue. ### Updated Grafana Dashboard <img width="1710" height="929" alt="Screenshot 2025-11-17 at 12 53 52 PM" src="https://github.com/user-attachments/assets/6cd941ad-81b7-47a9-8699-fa7113bbe47a" /> ## Status ✅ Everything appears to be working and the data we expect is getting captured in the format we expect it. ## Follow Ups 1. Make tool calling spans follow semconv and capture more data 1. Consider using existing tracing library 2. Make shield spans follow semconv 3. Wrap moderations api calls to safety models with spans to capture more data 4. Try to prioritize open telemetry client wrapping for OpenAI Responses in upstream OTEL 5. This would break the telemetry tests, and they are currently disabled. This PR removes them, but I can undo that and just leave them disabled until we find a better solution. 6. Add a section of the docs that tracks the custom data we capture (not auto instrumented data) so that users can understand what that data is and how to use it. Commit those changes to the OTEL-gen_ai SIG if possible as well. Here is an [example](https://opentelemetry.io/docs/specs/semconv/gen-ai/aws-bedrock/) of how bedrock handles it.
16 KiB
| orphan |
|---|
| true |
Starter Distribution
:maxdepth: 2
:hidden:
self
The llamastack/distribution-starter distribution is a comprehensive, multi-provider distribution that includes most of the available inference providers in Llama Stack. It's designed to be a one-stop solution for developers who want to experiment with different AI providers without having to configure each one individually.
Provider Composition
The starter distribution consists of the following provider configurations:
| API | Provider(s) |
|---|---|
| agents | inline::meta-reference |
| datasetio | remote::huggingface, inline::localfs |
| eval | inline::meta-reference |
| files | inline::localfs |
| inference | remote::openai, remote::fireworks, remote::together, remote::ollama, remote::anthropic, remote::gemini, remote::groq, remote::sambanova, remote::vllm, remote::tgi, remote::cerebras, remote::llama-openai-compat, remote::nvidia, remote::hf::serverless, remote::hf::endpoint, inline::sentence-transformers |
| safety | inline::llama-guard |
| scoring | inline::basic, inline::llm-as-judge, inline::braintrust |
| tool_runtime | remote::brave-search, remote::tavily-search, inline::rag-runtime, remote::model-context-protocol |
| vector_io | inline::faiss, inline::sqlite-vec, inline::milvus, remote::chromadb, remote::pgvector |
Inference Providers
The starter distribution includes a comprehensive set of inference providers:
Hosted Providers
- OpenAI: GPT-4, GPT-3.5, O1, O3, O4 models and text embeddings -
provider ID:
openai- reference documentation: openai - Fireworks: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and
embeddings - provider ID:
fireworks- reference documentation: fireworks - Together: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and
embeddings - provider ID:
together- reference documentation: together - Anthropic: Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Haiku, and Voyage embeddings - provider ID:
anthropic- reference documentation: anthropic - Gemini: Gemini 1.5, 2.0, 2.5 models and text embeddings - provider ID:
gemini- reference documentation: gemini - Groq: Fast Llama models (3.1, 3.2, 3.3, 4 Scout, 4 Maverick) - provider ID:
groq- reference documentation: groq - SambaNova: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models - provider ID:
sambanova- reference documentation: sambanova - Cerebras: Cerebras AI models - provider ID:
cerebras- reference documentation: cerebras - NVIDIA: NVIDIA NIM - provider ID:
nvidia- reference documentation: nvidia - HuggingFace: Serverless and endpoint models - provider ID:
hf::serverlessandhf::endpoint- reference documentation: huggingface-serverless and huggingface-endpoint - Bedrock: AWS Bedrock models - provider ID:
bedrock- reference documentation: bedrock
Local/Remote Providers
- Ollama: Local Ollama models - provider ID:
ollama- reference documentation: ollama - vLLM: Local or remote vLLM server - provider ID:
vllm- reference documentation: vllm - TGI: Text Generation Inference server - Dell Enterprise Hub's custom TGI container too (use
DEH_URL) - provider ID:tgi- reference documentation: tgi - Sentence Transformers: Local embedding models - provider ID:
sentence-transformers- reference documentation: sentence-transformers
All providers are disabled by default. So you need to enable them by setting the environment variables.
Vector IO
The starter distribution includes a comprehensive set of vector IO providers:
- FAISS: Local FAISS vector store - enabled by
default - provider ID:
faiss - SQLite: Local SQLite vector store - disabled by default - provider ID:
sqlite-vec - ChromaDB: Remote ChromaDB vector store - disabled by default - provider ID:
chromadb - PGVector: PostgreSQL vector store - disabled by default - provider ID:
pgvector - Milvus: Milvus vector store - disabled by default - provider ID:
milvus
Environment Variables
The following environment variables can be configured:
Server Configuration
LLAMA_STACK_PORT: Port for the Llama Stack distribution server (default:8321)
API Keys for Hosted Providers
OPENAI_API_KEY: OpenAI API keyFIREWORKS_API_KEY: Fireworks API keyTOGETHER_API_KEY: Together API keyANTHROPIC_API_KEY: Anthropic API keyGEMINI_API_KEY: Google Gemini API keyGROQ_API_KEY: Groq API keySAMBANOVA_API_KEY: SambaNova API keyCEREBRAS_API_KEY: Cerebras API keyLLAMA_API_KEY: Llama API keyNVIDIA_API_KEY: NVIDIA API keyHF_API_TOKEN: HuggingFace API token
Local Provider Configuration
OLLAMA_URL: Ollama server URL (default:http://localhost:11434)VLLM_URL: vLLM server URL (default:http://localhost:8000/v1)VLLM_MAX_TOKENS: vLLM max tokens (default:4096)VLLM_API_TOKEN: vLLM API token (default:fake)VLLM_TLS_VERIFY: vLLM TLS verification (default:true)TGI_URL: TGI server URL
Model Configuration
INFERENCE_MODEL: HuggingFace model for serverless inferenceINFERENCE_ENDPOINT_NAME: HuggingFace endpoint name
Vector Database Configuration
SQLITE_STORE_DIR: SQLite store directory (default:~/.llama/distributions/starter)ENABLE_SQLITE_VEC: Enable SQLite vector providerENABLE_CHROMADB: Enable ChromaDB providerENABLE_PGVECTOR: Enable PGVector providerCHROMADB_URL: ChromaDB server URLPGVECTOR_HOST: PGVector host (default:localhost)PGVECTOR_PORT: PGVector port (default:5432)PGVECTOR_DB: PGVector database namePGVECTOR_USER: PGVector usernamePGVECTOR_PASSWORD: PGVector password
Tool Configuration
BRAVE_SEARCH_API_KEY: Brave Search API keyTAVILY_SEARCH_API_KEY: Tavily Search API key
Enabling Providers
You can enable specific providers by setting appropriate environment variables. For example,
# self-hosted
export OLLAMA_URL=http://localhost:11434 # enables the Ollama inference provider
export VLLM_URL=http://localhost:8000/v1 # enables the vLLM inference provider
export TGI_URL=http://localhost:8000/v1 # enables the TGI inference provider
# cloud-hosted requiring API key configuration on the server
export CEREBRAS_API_KEY=your_cerebras_api_key # enables the Cerebras inference provider
export NVIDIA_API_KEY=your_nvidia_api_key # enables the NVIDIA inference provider
# vector providers
export MILVUS_URL=http://localhost:19530 # enables the Milvus vector provider
export CHROMADB_URL=http://localhost:8000/v1 # enables the ChromaDB vector provider
export PGVECTOR_DB=llama_stack_db # enables the PGVector vector provider
This distribution comes with a default "llama-guard" shield that can be enabled by setting the SAFETY_MODEL environment variable to point to an appropriate Llama Guard model id. Use llama-stack-client models list to see the list of available models.
Running the Distribution
You can run the starter distribution via Docker or venv.
Via Docker
This method allows you to get started quickly without having to build the distribution code.
LLAMA_STACK_PORT=8321
docker run \
-it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-e OPENAI_API_KEY=your_openai_key \
-e FIREWORKS_API_KEY=your_fireworks_key \
-e TOGETHER_API_KEY=your_together_key \
llamastack/distribution-starter \
--port $LLAMA_STACK_PORT
The container will run the distribution with a SQLite store by default. This store is used for the following components:
- Metadata store: store metadata about the models, providers, etc.
- Inference store: collect of responses from the inference provider
- Agents store: store agent configurations (sessions, turns, etc.)
- Agents Responses store: store responses from the agents
However, you can use PostgreSQL instead by running the starter::run-with-postgres-store.yaml configuration:
docker run \
-it \
--pull always \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-e OPENAI_API_KEY=your_openai_key \
-e FIREWORKS_API_KEY=your_fireworks_key \
-e TOGETHER_API_KEY=your_together_key \
-e POSTGRES_HOST=your_postgres_host \
-e POSTGRES_PORT=your_postgres_port \
-e POSTGRES_DB=your_postgres_db \
-e POSTGRES_USER=your_postgres_user \
-e POSTGRES_PASSWORD=your_postgres_password \
llamastack/distribution-starter \
starter::run-with-postgres-store.yaml
Postgres environment variables:
POSTGRES_HOST: Postgres host (default:localhost)POSTGRES_PORT: Postgres port (default:5432)POSTGRES_DB: Postgres database name (default:llamastack)POSTGRES_USER: Postgres username (default:llamastack)POSTGRES_PASSWORD: Postgres password (default:llamastack)
Via Conda or venv
Ensure you have configured the starter distribution using the environment variables explained above.
# Install dependencies for the starter distribution
uv run --with llama-stack llama stack list-deps starter | xargs -L1 uv pip install
# Run the server (with SQLite - default)
uv run --with llama-stack llama stack run starter
# Or run with PostgreSQL
uv run --with llama-stack llama stack run starter::run-with-postgres-store.yaml
Example Usage
Once the distribution is running, you can use any of the available models. Here are some examples:
Using OpenAI Models
llama-stack-client --endpoint http://localhost:8321 \
inference chat-completion \
--model-id openai/gpt-4o \
--message "Hello, how are you?"
Using Fireworks Models
llama-stack-client --endpoint http://localhost:8321 \
inference chat-completion \
--model-id fireworks/meta-llama/Llama-3.2-3B-Instruct \
--message "Write a short story about a robot."
Using Local Ollama Models
# First, make sure Ollama is running and you have a model
ollama run llama3.2:3b
# Then use it through Llama Stack
export OLLAMA_INFERENCE_MODEL=llama3.2:3b
llama-stack-client --endpoint http://localhost:8321 \
inference chat-completion \
--model-id ollama/llama3.2:3b \
--message "Explain quantum computing in simple terms."
Storage
The starter distribution uses SQLite for local storage of various components:
- Metadata store:
~/.llama/distributions/starter/registry.db - Inference store:
~/.llama/distributions/starter/inference_store.db - FAISS store:
~/.llama/distributions/starter/faiss_store.db - SQLite vector store:
~/.llama/distributions/starter/sqlite_vec.db - Files metadata:
~/.llama/distributions/starter/files_metadata.db - Agents store:
~/.llama/distributions/starter/agents_store.db - Responses store:
~/.llama/distributions/starter/responses_store.db - Evaluation store:
~/.llama/distributions/starter/meta_reference_eval.db - Dataset I/O stores: Various HuggingFace and local filesystem stores
Benefits of the Starter Distribution
- Comprehensive Coverage: Includes most popular AI providers in one distribution
- Flexible Configuration: Easy to enable/disable providers based on your needs
- No Local GPU Required: Most providers are cloud-based, making it accessible to developers without high-end hardware
- Easy Migration: Start with hosted providers and gradually move to local ones as needed
- Production Ready: Includes safety and evaluation
- Tool Integration: Comes with web search, RAG, and model context protocol tools
The starter distribution is ideal for developers who want to experiment with different AI providers, build prototypes quickly, or create applications that can work with multiple AI backends.