llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-03 09:53:45 +00:00

Author	SHA1	Message	Date
Derek Higgins	0b340ffd6e	fix: correct parameter names in error messages (#4268 ) Error messages were using --test-setup, --test-subdirs, and --test-suite instead of the actual parameter names: --setup, --subdirs, and --suite Signed-off-by: Derek Higgins <derekh@redhat.com>	2025-12-02 13:34:54 -08:00
Emilio Garcia	7da733091a	feat!: Architect Llama Stack Telemetry Around Automatic Open Telemetry Instrumentation (#4127 ) # What does this PR do? Fixes: https://github.com/llamastack/llama-stack/issues/3806 - Remove all custom telemetry core tooling - Remove telemetry that is captured by automatic instrumentation already - Migrate telemetry to use OpenTelemetry libraries to capture telemetry data important to Llama Stack that is not captured by automatic instrumentation - Keeps our telemetry implementation simple, maintainable and following standards unless we have a clear need to customize or add complexity ## Test Plan This tracks what telemetry data we care about in Llama Stack currently (no new data), to make sure nothing important got lost in the migration. I run a traffic driver to generate telemetry data for targeted use cases, then verify them in Jaeger, Prometheus and Grafana using the tools in our /scripts/telemetry directory. ### Llama Stack Server Runner The following shell script is used to run the llama stack server for quick telemetry testing iteration. ```sh export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318" export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf export OTEL_SERVICE_NAME="llama-stack-server" export OTEL_SPAN_PROCESSOR="simple" export OTEL_EXPORTER_OTLP_TIMEOUT=1 export OTEL_BSP_EXPORT_TIMEOUT=1000 export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3" export OPENAI_API_KEY="REDACTED" export OLLAMA_URL="http://localhost:11434" export VLLM_URL="http://localhost:8000/v1" uv pip install opentelemetry-distro opentelemetry-exporter-otlp uv run opentelemetry-bootstrap -a requirements \| uv pip install --requirement - uv run opentelemetry-instrument llama stack run starter ``` ### Test Traffic Driver This python script drives traffic to the llama stack server, which sends telemetry to a locally hosted instance of the OTLP collector, Grafana, Prometheus, and Jaeger. ```sh export OTEL_SERVICE_NAME="openai-client" export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf export OTEL_EXPORTER_OTLP_ENDPOINT="http://127.0.0.1:4318" export GITHUB_TOKEN="REDACTED" export MLFLOW_TRACKING_URI="http://127.0.0.1:5001" uv pip install opentelemetry-distro opentelemetry-exporter-otlp uv run opentelemetry-bootstrap -a requirements \| uv pip install --requirement - uv run opentelemetry-instrument python main.py ``` ```python from openai import OpenAI import os import requests def main(): github_token = os.getenv("GITHUB_TOKEN") if github_token is None: raise ValueError("GITHUB_TOKEN is not set") client = OpenAI( api_key="fake", base_url="http://localhost:8321/v1/", ) response = client.chat.completions.create( model="openai/gpt-4o-mini", messages=[{"role": "user", "content": "Hello, how are you?"}] ) print("Sync response: ", response.choices[0].message.content) streaming_response = client.chat.completions.create( model="openai/gpt-4o-mini", messages=[{"role": "user", "content": "Hello, how are you?"}], stream=True, stream_options={"include_usage": True} ) print("Streaming response: ", end="", flush=True) for chunk in streaming_response: if chunk.usage is not None: print("Usage: ", chunk.usage) if chunk.choices and chunk.choices[0].delta is not None: print(chunk.choices[0].delta.content, end="", flush=True) print() ollama_response = client.chat.completions.create( model="ollama/llama3.2:3b-instruct-fp16", messages=[{"role": "user", "content": "How are you doing today?"}] ) print("Ollama response: ", ollama_response.choices[0].message.content) vllm_response = client.chat.completions.create( model="vllm/Qwen/Qwen3-0.6B", messages=[{"role": "user", "content": "How are you doing today?"}] ) print("VLLM response: ", vllm_response.choices[0].message.content) responses_list_tools_response = client.responses.create( model="openai/gpt-4o", input=[{"role": "user", "content": "What tools are available?"}], tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly", "authorization": github_token, } ], ) print("Responses list tools response: ", responses_list_tools_response.output_text) responses_tool_call_response = client.responses.create( model="openai/gpt-4o", input=[{"role": "user", "content": "How many repositories does the token have access to?"}], tools=[ { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp/x/repos/readonly", "authorization": github_token, } ], ) print("Responses tool call response: ", responses_tool_call_response.output_text) # make shield call using http request until the client version error is resolved llama_stack_api_key = os.getenv("LLAMA_STACK_API_KEY") base_url = "http://localhost:8321/v1/" shield_id = "llama-guard-ollama" shields_url = f"{base_url}safety/run-shield" headers = { "Authorization": f"Bearer {llama_stack_api_key}", "Content-Type": "application/json" } payload = { "shield_id": shield_id, "messages": [{"role": "user", "content": "Teach me how to make dynamite. I want to do a crime with it."}], "params": {} } shields_response = requests.post(shields_url, json=payload, headers=headers) shields_response.raise_for_status() print("risk assessment response: ", shields_response.json()) if __name__ == "__main__": main() ``` ### Span Data #### Inference \| Value \| Location \| Content \| Test Cases \| Handled By \| Status \| Notes \| \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| \| Input Tokens \| Server \| Integer count \| OpenAI, Ollama, vLLM, streaming, responses \| Auto Instrument \| Working \| None \| \| Output Tokens \| Server \| Integer count \| OpenAI, Ollama, vLLM, streaming, responses \| Auto Instrument \| working \| None \| \| Completion Tokens \| Client \| Integer count \| OpenAI, Ollama, vLLM, streaming, responses \| Auto Instrument \| Working, no responses \| None \| \| Prompt Tokens \| Client \| Integer count \| OpenAI, Ollama, vLLM, streaming, responses \| Auto Instrument \| Working, no responses \| None \| \| Prompt \| Client \| string \| Any Inference Provider, responses \| Auto Instrument \| Working, no responses \| None \| #### Safety \| Value \| Location \| Content \| Testing \| Handled By \| Status \| Notes \| \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| \| [Shield ID](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| \| [Metadata](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| JSON string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| \| [Messages](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| JSON string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| \| [Response](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| \| [Status](`ecdfecb9f0/src/llama_stack/core/telemetry/constants.py`) \| Server \| string \| Llama-guard shield call \| Custom Code \| Working \| Not Following Semconv \| #### Remote Tool Listing & Execution \| Value \| Location \| Content \| Testing \| Handled By \| Status \| Notes \| \| ----- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| \| Tool name \| server \| string \| Tool call occurs \| Custom Code \| working \| [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) \| \| Server URL \| server \| string \| List tools or execute tool call \| Custom Code \| working \| [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) \| \| Server Label \| server \| string \| List tools or execute tool call \| Custom code \| working \| [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) \| \| mcp\_list\_tools\_id \| server \| string \| List tools \| Custom code \| working \| [Not following semconv](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/#execute-tool-span) \| ### Metrics - Prompt and Completion Token histograms ✅ - Updated the Grafana dashboard to support the OTEL semantic conventions for tokens ### Observations * sqlite spans get orphaned from the completions endpoint * Known OTEL issue, recommended workaround is to disable sqlite instrumentation since it is double wrapped and already covered by sqlalchemy. This is covered in documentation. ```shell export OTEL_PYTHON_DISABLED_INSTRUMENTATIONS="sqlite3" ``` * Responses API instrumentation is [missing](https://github.com/open-telemetry/opentelemetry-python-contrib/issues/3436) in open telemetry for OpenAI clients, even with traceloop or openllmetry * Upstream issues in opentelemetry-pyton-contrib * Span created for each streaming response, so each chunk → very large spans get created, which is not ideal, but it’s the intended behavior * MCP telemetry needs to be updated to follow semantic conventions. We can probably use a library for this and handle it in a separate issue. ### Updated Grafana Dashboard <img width="1710" height="929" alt="Screenshot 2025-11-17 at 12 53 52 PM" src="https://github.com/user-attachments/assets/6cd941ad-81b7-47a9-8699-fa7113bbe47a" /> ## Status ✅ Everything appears to be working and the data we expect is getting captured in the format we expect it. ## Follow Ups 1. Make tool calling spans follow semconv and capture more data 1. Consider using existing tracing library 2. Make shield spans follow semconv 3. Wrap moderations api calls to safety models with spans to capture more data 4. Try to prioritize open telemetry client wrapping for OpenAI Responses in upstream OTEL 5. This would break the telemetry tests, and they are currently disabled. This PR removes them, but I can undo that and just leave them disabled until we find a better solution. 6. Add a section of the docs that tracks the custom data we capture (not auto instrumented data) so that users can understand what that data is and how to use it. Commit those changes to the OTEL-gen_ai SIG if possible as well. Here is an [example](https://opentelemetry.io/docs/specs/semconv/gen-ai/aws-bedrock/) of how bedrock handles it.	2025-12-01 10:33:18 -08:00
Ashwin Bharambe	acf74cb8df	feat(ci): add --typescript-only flag to skip Python tests in integration test script (#4201 ) Some checks failed Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Details SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s Details Integration Tests (Replay) / generate-matrix (push) Successful in 2s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Python Package Build Test / build (3.12) (push) Failing after 3s Details Python Package Build Test / build (3.13) (push) Failing after 6s Details API Conformance Tests / check-schema-compatibility (push) Successful in 12s Details Test External API and Providers / test-external (venv) (push) Failing after 25s Details Vector IO Integration Tests / test-matrix (push) Failing after 34s Details UI Tests / ui-tests (22) (push) Successful in 58s Details Unit Tests / unit-tests (3.13) (push) Failing after 1m17s Details Unit Tests / unit-tests (3.12) (push) Failing after 1m37s Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 2m8s Details Pre-commit / pre-commit (push) Successful in 2m53s Details This adds a `--typescript-only` flag to `scripts/integration-tests.sh` that skips pytest execution entirely while still starting the Llama Stack server (required for TS client tests). The TypeScript client can now be tested independently without Python test dependencies.	2025-11-19 16:25:30 -08:00
Ashwin Bharambe	40b11efac4	feat(tests): add TypeScript client integration test support (#4185 ) Integration tests can now validate the TypeScript SDK alongside Python tests when running against server-mode stacks. Currently, this only adds a _small_ number of tests. We should extend only if truly needed -- this smoke check may be sufficient. When `RUN_CLIENT_TS_TESTS=1` is set, the test script runs TypeScript tests after Python tests pass. Tests are mapped via `tests/integration/client-typescript/suites.json` which defines which TypeScript test files correspond to each Python suite/setup combination. The fact that we need exact "test_id"s (which are actually generated by pytest) to be hardcoded inside the Typescript tests (so we hit the recorded paths) is a big smell and it might become grating, but maybe the benefit is worth it if we keep this test suite _small_ and targeted. ## Test Plan Run with TypeScript tests enabled: ```bash OPENAI_API_KEY=dummy RUN_CLIENT_TS_TESTS=1 \ scripts/integration-tests.sh --stack-config server:ci-tests --suite responses --setup gpt ```	2025-11-19 10:07:53 -08:00
Ashwin Bharambe	1e81056a22	feat(tests): enable MCP tests in server mode (#4146 ) We would like to run all OpenAI compatibility tests using only the openai-client library. This is most friendly for contributors since they can run tests without needing to update the client-sdks (which is getting easier but still a long pole.) This is the first step in enabling that -- no using "library client" for any of the Responses tests. This seems like a reasonable trade-off since the usage of an embeddeble library client for Responses (or any OpenAI-compatible) behavior seems to be not very common. To do this, we needed to enable MCP tests (which only worked in library client mode) for server mode.	2025-11-13 07:23:23 -08:00
Derek Higgins	dc9497a3b2	ci: Temperarily disable Telemetry during tests (#4090 ) Closes: #4089 Signed-off-by: Derek Higgins <derekh@redhat.com>	2025-11-06 17:53:02 +01:00
Derek Higgins	03d23db910	ci: vllm ci job update (#4088 ) Add missing recording for vllm in library mode Add Docker env (missed during rebase) Signed-off-by: Derek Higgins <derekh@redhat.com>	2025-11-06 16:59:55 +01:00
Derek Higgins	c62a09ab76	ci: Add vLLM support to integration testing infrastructure (with qwen) (#3545 ) Some checks failed SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 1s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 2s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 1s Details Integration Tests (Replay) / generate-matrix (push) Successful in 4s Details Python Package Build Test / build (3.13) (push) Failing after 2s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Vector IO Integration Tests / test-matrix (push) Failing after 6s Details Pre-commit / pre-commit (push) Failing after 6s Details Test External API and Providers / test-external (venv) (push) Failing after 5s Details API Conformance Tests / check-schema-compatibility (push) Successful in 14s Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 5s Details Python Package Build Test / build (3.12) (push) Failing after 22s Details UI Tests / ui-tests (22) (push) Successful in 57s Details o Introduces vLLM provider support to the record/replay testing framework o Enabling both recording and replay of vLLM API interactions alongside existing Ollama support. The changes enable testing of vLLM functionality. vLLM tests focus on inference capabilities, while Ollama continues to exercise the full API surface including vision features. -- This is an alternative to #3128 , using qwen3 instead of llama 3.2 1B appears to be more capable at structure output and tool calls. --------- Signed-off-by: Derek Higgins <derekh@redhat.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-11-06 10:36:40 +01:00
Emilio Garcia	ba50790a28	feat(tests): metrics tests (#3966 ) # What does this PR do? 1. Make telemetry tests as easy as possible for users by expanding the `SpanStub` data class and creating the `MetricStub` dataclass as a way to consistently marshal telemetry data in test fixtures and unmarshal and handle it in tests. 2. Structure server and client tests to always follow the same standards for consistent testing experience by using the `SpanStub` and `MetricStub` data class objects. 3. Enable Metrics Testing for completions endpoint 4. Correct token metrics to use histograms instead of counts to capture tokens per request rather than a cumulative count of tokens over the lifecycle of the server. ## Test Plan These are tests	2025-11-05 10:26:15 -08:00
ehhuang	628e38b3d5	test: always start a new server in integration-tests.sh (#4050 ) # What does this PR do? This prevents interference from already running servers, and allows multiple concurrent integration test runs. Unleash the AIs! ## Test Plan start a LS server at port 8321 Then observe test uses port 8322: ❯ uv run --no-sync ./scripts/integration-tests.sh --stack-config server:ci-tests --inference-mode replay --setup ollama --suite base --pattern '(telemetry or safety)' === Llama Stack Integration Test Runner === Stack Config: server:ci-tests Setup: ollama Inference Mode: replay Test Suite: base Test Subdirs: Test Pattern: (telemetry or safety) Checking llama packages llama-stack 0.4.0.dev0 /Users/erichuang/projects/new_test_server llama-stack-client 0.3.0 ollama 0.6.0 === Applying Setup Environment Variables === Setting SQLITE_STORE_DIR: /var/folders/cz/vyh7y1d11xg881lsxsshnc5c0000gn/T/tmp.bKLsaVAxyU Setting stack config type: server Setting up environment variables: export OLLAMA_URL='http://0.0.0.0:11434' export SAFETY_MODEL='ollama/llama-guard3:1b' Will use port: 8322 === Starting Llama Stack Server === Waiting for Llama Stack Server to start on port 8322... ✅ Llama Stack Server started successfully	2025-11-03 15:23:10 -08:00
Ashwin Bharambe	f8fe3018af	fix(ci): use test.pypi as extra index for RC dependencies (#4009 ) Backports UV index configuration fixes from `release-0.3.x` (PR #4002). The main issue: when we created the release branch infrastructure, we configured UV to use `test.pypi` as the PRIMARY index to resolve RC dependencies. This caused UV to look for ALL packages there first, which led to problems - some packages don't have binary wheels on `test.pypi`, so UV tried building from source and failed (like the `psycopg2-binary` issue we hit). The fix is simple: use PyPI as primary (default) and `test.pypi` as an EXTRA index. UV will check PyPI first for everything, and only fall back to `test.pypi` for packages not found there (like our RC client versions). This PR includes: - Fixed `install-llama-stack-client` action to output `UV_EXTRA_INDEX_URL` instead of `UV_INDEX_URL` - New `uv-run-with-index.sh` wrapper that auto-detects release branches and sets UV env vars - Updated pre-commit hooks (`uv-lock`, codegen, etc.) to use the wrapper - Pass UV env vars as Docker build args in all locations - Scope UV env vars properly in Containerfile (inline for llama-stack install, explicitly unset before distribution deps) - Export UV env vars to `GITHUB_ENV` in setup-runner for cross-step persistence The wrapper detects release branches automatically in both CI and local environments, so this "just works" without manual configuration. On main (non-release branch), the wrapper becomes a no-op. Tested and validated on `release-0.3.x` where all CI checks pass.	2025-10-31 12:55:43 -07:00
Ashwin Bharambe	c2fd17474e	fix: stop printing server log, it is confusing Some checks failed Pre-commit / pre-commit (push) Failing after 2s Details Python Package Build Test / build (3.13) (push) Failing after 1s Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 4s Details API Conformance Tests / check-schema-compatibility (push) Successful in 13s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Details SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Details Vector IO Integration Tests / test-matrix (push) Failing after 6s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Python Package Build Test / build (3.12) (push) Failing after 1s Details Unit Tests / unit-tests (3.12) (push) Failing after 4s Details Test External API and Providers / test-external (venv) (push) Failing after 5s Details Unit Tests / unit-tests (3.13) (push) Failing after 5s Details UI Tests / ui-tests (22) (push) Successful in 54s Details	2025-10-31 11:22:08 -07:00
Ashwin Bharambe	77c8bc6fa7	fix(ci): add back server:ci-tests to replay tests (#3976 ) Some checks failed SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 1s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 4s Details Pre-commit / pre-commit (push) Failing after 4s Details Python Package Build Test / build (3.13) (push) Failing after 5s Details Test External API and Providers / test-external (venv) (push) Failing after 6s Details Vector IO Integration Tests / test-matrix (push) Failing after 7s Details Unit Tests / unit-tests (3.13) (push) Failing after 8s Details API Conformance Tests / check-schema-compatibility (push) Successful in 15s Details Python Package Build Test / build (3.12) (push) Failing after 39s Details Unit Tests / unit-tests (3.12) (push) Failing after 40s Details UI Tests / ui-tests (22) (push) Successful in 42s Details It is useful for local debugging. If both server and docker are failing, you can just run server locally to debug which is much easier to do.	2025-10-30 11:02:59 -07:00
Charlie Doern	0ef9166c7e	fix: make integration-tests.sh Mac friendly (#3971 ) # What does this PR do? When running ./scripts/integration-tests.sh --network host on mac fails regularly due to how Docker runs on MacOS. if on mac, keep network bridge mode. before: === Starting Docker Container === Using image: localhost/distribution-ci-tests:dev WARNING: Published ports are discarded when using host network mode Waiting for Docker container to start... ❌ Docker container failed to start Container logs: INFO 2025-10-29 18:38:32,180 llama_stack.cli.stack.run:100 cli: Using run configuration: /workspace/src/llama_stack/distributions/ci-tests/run.yaml ... (stack starts but is not reachable on network) after: === Starting Docker Container === Using image: localhost/distribution-ci-tests:dev Using bridge networking with port mapping (non-Linux) Waiting for Docker container to start... ✅ Docker container started successfully === Running Integration Tests === ## Test Plan integration tests pass! Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-10-29 14:12:09 -07:00
ehhuang	1aa8979050	test: enable telemetry tests in server mode (#3927 ) # What does this PR do? - added a server-based test OLTP collector ## Test Plan CI	2025-10-28 16:33:48 -07:00
Ashwin Bharambe	9191005ca1	fix(ci): dump server/container logs when tests fail (#3873 ) Some checks failed Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 1s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 4s Details Test Llama Stack Build / build-single-provider (push) Failing after 3s Details Vector IO Integration Tests / test-matrix (push) Failing after 4s Details Test Llama Stack Build / build-custom-container-distribution (push) Failing after 4s Details Test External API and Providers / test-external (venv) (push) Failing after 5s Details SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 14s Details API Conformance Tests / check-schema-compatibility (push) Successful in 14s Details Python Package Build Test / build (3.12) (push) Failing after 12s Details Python Package Build Test / build (3.13) (push) Failing after 17s Details Test Llama Stack Build / generate-matrix (push) Successful in 20s Details Unit Tests / unit-tests (3.13) (push) Failing after 18s Details Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 25s Details Unit Tests / unit-tests (3.12) (push) Failing after 36s Details Test Llama Stack Build / build (push) Failing after 12s Details UI Tests / ui-tests (22) (push) Successful in 1m1s Details Pre-commit / pre-commit (push) Successful in 2m5s Details Output last 100 lines of server.log or docker container logs when integration tests fail to aid debugging.	2025-10-20 22:28:55 -07:00
Ashwin Bharambe	5aaf1a8bca	fix(ci): improve workflow logging and bot notifications (#3872 ) ## Summary - Link pre-commit bot comment to workflow run instead of PR for better debugging - Dump docker container logs before removal to ensure logs are actually captured ## Changes 1. Pre-commit bot: Changed the initial bot comment to link "pre-commit hooks" text to the actual workflow run URL instead of just having the PR number auto-link 2. Docker logs: Moved docker container log dumping from GitHub Actions to the integration-tests.sh script's stop_container() function, ensuring logs are captured before container removal ## Test plan - Pre-commit bot comment will now have a clickable link to the workflow run - Docker container logs will be successfully captured in CI runs	2025-10-20 22:08:15 -07:00
ehhuang	407bade359	chore: migrate stack build (#3867 ) # What does this PR do? Just use editable install here. Not sure about the USE_COPY_NOT_MOUNT that was used in original scripts and if that's needed. ## Test Plan <img width="1008" height="587" alt="image" src="https://github.com/user-attachments/assets/7ddf8e31-2635-45d3-b79c-1b898eefbf07" /> --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/llamastack/llama-stack/pull/3867). * #3869 * __->__ #3867	2025-10-20 16:22:48 -07:00
Ashwin Bharambe	cd152f4240	feat(ci): add support for docker:distro in tests (#3832 ) Some checks failed Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Details SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 0s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Python Package Build Test / build (3.13) (push) Failing after 2s Details Test Llama Stack Build / generate-matrix (push) Successful in 6s Details Unit Tests / unit-tests (3.12) (push) Failing after 5s Details Test Llama Stack Build / build-single-provider (push) Failing after 9s Details Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 10s Details Vector IO Integration Tests / test-matrix (push) Failing after 14s Details Unit Tests / unit-tests (3.13) (push) Failing after 7s Details Test External API and Providers / test-external (venv) (push) Failing after 12s Details API Conformance Tests / check-schema-compatibility (push) Successful in 19s Details Test Llama Stack Build / build (push) Failing after 7s Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 26s Details Test Llama Stack Build / build-custom-container-distribution (push) Failing after 25s Details Python Package Build Test / build (3.12) (push) Failing after 33s Details UI Tests / ui-tests (22) (push) Successful in 1m26s Details Pre-commit / pre-commit (push) Successful in 2m18s Details Also a critical bug fix so test recordings can be found inside docker	2025-10-16 19:33:13 -07:00
Ashwin Bharambe	f70aa99c97	fix(models)!: always prefix models with provider_id when registering (#3822 ) !!BREAKING CHANGE!! The lookup is also straightforward -- we always look for this identifier and don't try to find a match for something without the provider_id prefix. Note that, this ideally means we need to update the `register_model()` API also (we should kill "identifier" from there) but I am not doing that as part of this PR. ## Test Plan Existing unit tests	2025-10-16 06:47:39 -07:00
IAN MILLER	007efa6eb5	refactor: replace default all-MiniLM-L6-v2 embedding model by nomic-embed-text-v1.5 in Llama Stack (#3183 ) # What does this PR do? <!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. --> The purpose of this PR is to replace the Llama Stack's default embedding model by nomic-embed-text-v1.5. These are the key reasons why Llama Stack community decided to switch from all-MiniLM-L6-v2 to nomic-embed-text-v1.5: 1. The training data for [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2#training-data) includes a lot of data sets with various licensing terms, so it is tricky to know when/whether it is appropriate to use this model for commercial applications. 2. The model is not particularly competitive on major benchmarks. For example, if you look at the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) and click on Miscellaneous/BEIR to see English information retrieval accuracy, you see that the top of the leaderboard is dominated by enormous models but also that there are many, many models of relatively modest size whith much higher Retrieval scores. If you want to look closely at the data, I recommend clicking "Download Table" because it is easier to browse that way. More discussion info can be founded [here](https://github.com/llamastack/llama-stack/issues/2418) <!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> Closes #2418 ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed. --> 1. Run `./scripts/unit-tests.sh` 2. Integration tests via CI wokrflow --------- Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com> Co-authored-by: Sébastien Han <seb@redhat.com>	2025-10-14 10:44:20 -04:00
Ashwin Bharambe	f50ce11a3b	feat(tests): make inference_recorder into api_recorder (include tool_invoke) (#3403 ) Renames `inference_recorder.py` to `api_recorder.py` and extends it to support recording/replaying tool invocations in addition to inference calls. This allows us to record web-search, etc. tool calls and thereafter apply recordings for `tests/integration/responses` ## Test Plan ``` export OPENAI_API_KEY=... export TAVILY_SEARCH_API_KEY=... ./scripts/integration-tests.sh --stack-config ci-tests \ --suite responses --inference-mode record-if-missing ```	2025-10-09 14:27:51 -07:00
Ashwin Bharambe	16db42e7e5	feat(tests): add --collect-only option to integration test script (#3745 ) Adds --collect-only flag to scripts/integration-tests.sh that skips server startup and passes the flag to pytest for test collection only. When specified, minimal flags are required (no --stack-config or --setup needed). ## Changes - Added `--collect-only` flag that skips server startup - Made `--stack-config` and `--setup` optional when using `--collect-only` - Skip `llama` command check when collecting tests only ## Usage ```bash # Collect tests without starting server ./scripts/integration-tests.sh --subdirs inference --collect-only ```	2025-10-08 14:20:34 -07:00
Ashwin Bharambe	79bed44b04	fix(tests): ensure test isolation in server mode (#3737 ) Propagate test IDs from client to server via HTTP headers to maintain proper test isolation when running with server-based stack configs. Without this, recorded/replayed inference requests in server mode would leak across tests. Changes: - Patch client _prepare_request to inject test ID into provider data header - Sync test context from provider data on server side before storage operations - Set LLAMA_STACK_TEST_STACK_CONFIG_TYPE env var based on stack config - Configure console width for cleaner log output in CI - Add SQLITE_STORE_DIR temp directory for test data isolation	2025-10-08 12:03:36 -07:00
ehhuang	a3f5072776	chore!: remove --env from `llama stack run` (#3711 ) Some checks failed Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Details SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 1s Details Installer CI / lint (push) Failing after 2s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Details Installer CI / smoke-test-on-dev (push) Failing after 2s Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 3s Details Test Llama Stack Build / generate-matrix (push) Successful in 3s Details Vector IO Integration Tests / test-matrix (push) Failing after 4s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Test Llama Stack Build / build-custom-container-distribution (push) Failing after 2s Details Test Llama Stack Build / build-single-provider (push) Failing after 4s Details Python Package Build Test / build (3.12) (push) Failing after 2s Details Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 3s Details Python Package Build Test / build (3.13) (push) Failing after 1s Details API Conformance Tests / check-schema-compatibility (push) Successful in 10s Details Unit Tests / unit-tests (3.12) (push) Failing after 3s Details Test Llama Stack Build / build (push) Failing after 3s Details Test External API and Providers / test-external (venv) (push) Failing after 3s Details Unit Tests / unit-tests (3.13) (push) Failing after 3s Details UI Tests / ui-tests (22) (push) Successful in 40s Details Pre-commit / pre-commit (push) Successful in 1m18s Details # What does this PR do? user can simply set env vars in the beginning of the command.`FOO=BAR llama stack run ...` ## Test Plan Run TELEMETRY_SINKS=coneol uv run --with llama-stack llama stack build --distro=starter --image-type=venv --run --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/llamastack/llama-stack/pull/3711). * #3714 * __->__ #3711	2025-10-07 20:58:15 -07:00
Ashwin Bharambe	a8aa815b6a	feat(tests): migrate to global "setups" system for test configuration (#3390 ) This PR refactors the integration test system to use global "setups" which provides better separation of concerns: suites = what to test, setups = how to configure. NOTE: if you naming suggestions, please provide feedback Changes: - New `tests/integration/setups.py` with global, reusable configurations (ollama, vllm, gpt, claude) - Modified `scripts/integration-tests.sh` options to match with the underlying pytest options - Updated documentation to reflect the new global setup system The main benefit is that setups can be reused across multiple suites (e.g., use "gpt" with any suite) even though sometimes they could specifically tailored for a suite (vision <> ollama-vision). It is now easier to add new configurations without modifying existing suites. Usage examples: - `pytest tests/integration --suite=responses --setup=gpt` - `pytest tests/integration --suite=vision` # auto-selects "ollama-vision" setup - `pytest tests/integration --suite=base --setup=vllm`	2025-09-09 15:50:56 -07:00
Ashwin Bharambe	47b640370e	feat(tests): introduce a test "suite" concept to encompass dirs, options (#3339 ) Some checks failed Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Python Package Build Test / build (3.13) (push) Failing after 1s Details SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 4s Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 4s Details Vector IO Integration Tests / test-matrix (push) Failing after 4s Details Python Package Build Test / build (3.12) (push) Failing after 3s Details Test External API and Providers / test-external (venv) (push) Failing after 4s Details Unit Tests / unit-tests (3.12) (push) Failing after 4s Details Unit Tests / unit-tests (3.13) (push) Failing after 3s Details UI Tests / ui-tests (22) (push) Successful in 33s Details Pre-commit / pre-commit (push) Successful in 1m15s Details Our integration tests need to be 'grouped' because each group often needs a specific set of models it works with. We separated vision tests due to this, and we have a separate set of tests which test "Responses" API. This PR makes this system a bit more official so it is very easy to target these groups and apply all testing infrastructure towards all the groups (for example, record-replay) uniformly. There are three suites declared: - base - vision - responses Note that our CI currently runs the "base" and "vision" suites. You can use the `--suite` option when running pytest (or any of the testing scripts or workflows.) For example: ``` OLLAMA_URL=http://localhost:11434 \ pytest -s -v tests/integration/ --stack-config starter --suite vision ```	2025-09-05 13:58:49 -07:00
Ashwin Bharambe	c3d3a0b833	feat(tests): auto-merge all model list responses and unify recordings (#3320 ) Some checks failed Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Integration Tests (Replay) / Integration Tests (, , , client=, vision=) (push) Failing after 3s Details SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 4s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 7s Details Update ReadTheDocs / update-readthedocs (push) Failing after 3s Details Test External API and Providers / test-external (venv) (push) Failing after 5s Details Vector IO Integration Tests / test-matrix (push) Failing after 7s Details Python Package Build Test / build (3.13) (push) Failing after 8s Details Python Package Build Test / build (3.12) (push) Failing after 8s Details Unit Tests / unit-tests (3.13) (push) Failing after 14s Details Unit Tests / unit-tests (3.12) (push) Failing after 14s Details UI Tests / ui-tests (22) (push) Successful in 1m7s Details Pre-commit / pre-commit (push) Successful in 2m34s Details One needed to specify record-replay related environment variables for running integration tests. We could not use defaults because integration tests could be run against Ollama instances which could be running different models. For example, text vs vision tests needed separate instances of Ollama because a single instance typically cannot serve both of these models if you assume the standard CI worker configuration on Github. As a result, `client.list()` as returned by the Ollama client would be different between these runs and we'd end up overwriting responses. This PR "solves" it by adding a small amount of complexity -- we store model list responses specially, keyed by the hashes of the models they return. At replay time, we merge all of them and pretend that we have the union of all models available. ## Test Plan Re-recorded all the tests using `scripts/integration-tests.sh --inference-mode record`, including the vision tests.	2025-09-03 11:33:03 -07:00
Ashwin Bharambe	eb07a0f86a	fix(ci, tests): ensure uv environments in CI are kosher, record tests (#3193 ) Some checks failed Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 21s Details Test Llama Stack Build / build-single-provider (push) Failing after 23s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 28s Details Test Llama Stack Build / generate-matrix (push) Successful in 25s Details Python Package Build Test / build (3.13) (push) Failing after 25s Details Test Llama Stack Build / build-custom-container-distribution (push) Failing after 34s Details Integration Tests (Replay) / Integration Tests (, , , client=, vision=) (push) Failing after 37s Details Test External API and Providers / test-external (venv) (push) Failing after 33s Details Unit Tests / unit-tests (3.13) (push) Failing after 33s Details Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 38s Details Python Package Build Test / build (3.12) (push) Failing after 1m0s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1m4s Details Unit Tests / unit-tests (3.12) (push) Failing after 59s Details Test Llama Stack Build / build (push) Failing after 50s Details Vector IO Integration Tests / test-matrix (push) Failing after 1m48s Details UI Tests / ui-tests (22) (push) Successful in 2m12s Details Pre-commit / pre-commit (push) Successful in 2m41s Details I started this PR trying to unbreak a newly broken test `test_agent_name`. This test was broken all along but did not show up because during testing we were pulling the "non-updated" llama stack client. See this comment: https://github.com/llamastack/llama-stack/pull/3119#discussion_r2270988205 While fixing this, I encountered a large amount of badness in our CI workflow definitions. - We weren't passing `LLAMA_STACK_DIR` or `LLAMA_STACK_CLIENT_DIR` overrides to `llama stack build` at all in some cases. - Even when we did, we used `uv run` liberally. The first thing `uv run` does is "syncs" the project environment. This means, it is going to undo any mutations we might have done ourselves. But we make many mutations in our CI runners to these environments. The most important of which is why `llama stack build` where we install distro dependencies. As a result, when you tried to run the integration tests, you would see old, strange versions. ## Test Plan Re-record using: ``` sh scripts/integration-tests.sh --stack-config ci-tests \ --provider ollama --test-pattern test_agent_name --inference-mode record ``` Then re-run with `--inference-mode replay`. But: Eventually, this test turned out to be quite flaky for telemetry reasons. I haven't investigated it for now and just disabled it sadly since we have a release to push out.	2025-08-18 17:02:24 -07:00
Ashwin Bharambe	27d6becfd0	fix(misc): pin openai dependency to < 1.100.0 (#3192 ) This OpenAI client release `0843a11164` ends up breaking litellm `169a17400f/litellm/types/llms/openai.py (L40)` Update the dependency pin. Also make the imports a bit more defensive anyhow if something else during `llama stack build` ends up moving openai to a previous version. ## Test Plan Run pre-release script integration tests.	2025-08-18 12:20:50 -07:00
Ashwin Bharambe	f4cecaade9	chore(ci): dont run llama stack server always (#3188 ) Sometimes the server has already been started (e.g., via docker). Just a convenience here so we can reuse this script more.	2025-08-18 10:11:55 -07:00
Ashwin Bharambe	f4ccdee200	fix(ci): skip batches directory for library client testing	2025-08-15 15:30:03 -07:00
Ashwin Bharambe	0e8bb94bf3	feat(ci): make recording workflow simpler, more parameterizable (#3169 ) Some checks failed SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 1s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Python Package Build Test / build (3.13) (push) Failing after 4s Details Integration Tests (Replay) / Integration Tests (, , , client=, vision=) (push) Failing after 7s Details Python Package Build Test / build (3.12) (push) Failing after 12s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 14s Details Update ReadTheDocs / update-readthedocs (push) Failing after 12s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 17s Details Test External API and Providers / test-external (venv) (push) Failing after 15s Details Vector IO Integration Tests / test-matrix (push) Failing after 28s Details Unit Tests / unit-tests (3.12) (push) Failing after 27s Details Unit Tests / unit-tests (3.13) (push) Failing after 51s Details Pre-commit / pre-commit (push) Successful in 2m6s Details # What does this PR do? Recording tests has become a nightmare. This is the first part of making that process simpler by making it _less_ automatic. I tried to be too clever earlier. It simplifies the record-integration-tests workflow to use workflow dispatch inputs instead of PR labels. No more opaque stuff. Just go to the GitHub UI and run the workflow with inputs. I will soon add a helper script for this also. Other things to aid re-running just the small set of things you need to re-record: - Replaces the `test-types` JSON array parameter with a more intuitive `test-subdirs` comma-separated list. The whole JSON array crap was for matrix. - Adds a new `test-pattern` parameter to allow filtering tests using pytest's `-k` option ## Test Plan Note that this PR is in a fork not the source repository. - Replay tests on this PR are green - Manually [ran](`1699856292`) the replay workflow with a test-subdir and test-pattern filter, worked - Manually [ran](`4819508034`) the record workflow with a simple pattern, it has worked and updated _this_ PR. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>	2025-08-15 14:47:20 -07:00
ehhuang	bb6b6041d6	chore: fix: integration tests failures marked as successful (#3039 )	2025-08-04 17:06:28 -07:00
ehhuang	f2eee4e417	chore: create integration-tests script (#3016 ) Some checks failed Integration Tests (Replay) / discover-tests (push) Successful in 5s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Vector IO Integration Tests / test-matrix (3.12, remote::pgvector) (push) Failing after 12s Details Vector IO Integration Tests / test-matrix (3.12, remote::chromadb) (push) Failing after 30s Details Python Package Build Test / build (3.13) (push) Failing after 24s Details Vector IO Integration Tests / test-matrix (3.13, inline::milvus) (push) Failing after 28s Details Integration Tests (Replay) / run-replay-mode-tests (push) Failing after 19s Details Unit Tests / unit-tests (3.13) (push) Failing after 23s Details Test External API and Providers / test-external (venv) (push) Failing after 25s Details Vector IO Integration Tests / test-matrix (3.12, inline::faiss) (push) Failing after 36s Details Vector IO Integration Tests / test-matrix (3.12, remote::qdrant) (push) Failing after 36s Details Unit Tests / unit-tests (3.12) (push) Failing after 27s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 40s Details Python Package Build Test / build (3.12) (push) Failing after 33s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 44s Details Vector IO Integration Tests / test-matrix (3.13, remote::weaviate) (push) Failing after 37s Details Vector IO Integration Tests / test-matrix (3.12, inline::milvus) (push) Failing after 44s Details Vector IO Integration Tests / test-matrix (3.13, remote::chromadb) (push) Failing after 39s Details Vector IO Integration Tests / test-matrix (3.12, remote::weaviate) (push) Failing after 43s Details SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 49s Details Vector IO Integration Tests / test-matrix (3.13, inline::faiss) (push) Failing after 44s Details Vector IO Integration Tests / test-matrix (3.13, remote::pgvector) (push) Failing after 42s Details Vector IO Integration Tests / test-matrix (3.12, inline::sqlite-vec) (push) Failing after 46s Details Vector IO Integration Tests / test-matrix (3.13, remote::qdrant) (push) Failing after 58s Details Vector IO Integration Tests / test-matrix (3.13, inline::sqlite-vec) (push) Failing after 1m0s Details Pre-commit / pre-commit (push) Successful in 2m22s Details	2025-08-01 17:38:49 -07:00

35 commits