llama-stack-mirror

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-03 01:48:05 +00:00

Author	SHA1	Message	Date
Roy Belio	bd4a77ee2d	Merge branch 'main' into feat/gunicorn-production-server	2025-11-27 14:26:00 +02:00
Charlie Doern	aac494c5ba	fix: bind to proper default hosts (#4232 ) Some checks failed Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 3s Details SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 7s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 7s Details Integration Tests (Replay) / generate-matrix (push) Successful in 8s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details API Conformance Tests / check-schema-compatibility (push) Successful in 19s Details Python Package Build Test / build (3.12) (push) Successful in 18s Details Test External API and Providers / test-external (venv) (push) Failing after 26s Details Vector IO Integration Tests / test-matrix (push) Failing after 39s Details Python Package Build Test / build (3.13) (push) Successful in 38s Details UI Tests / ui-tests (22) (push) Successful in 1m24s Details Unit Tests / unit-tests (3.12) (push) Failing after 1m37s Details Unit Tests / unit-tests (3.13) (push) Failing after 2m27s Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 2m50s Details Pre-commit / pre-commit (push) Successful in 4m1s Details # What does this PR do? we used to have ` host = config.server.host or ["::", "0.0.0.0"]` but now only bind to ` host = config.server.host or "0.0.0.0"` revert back to the old logic, this allows us to curl http://localhost:8321/v1/models on fedora, which defaults to using IPv6. resolves #4210 Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-11-26 06:16:28 -05:00
Roy Belio	893d49c59e	Merge branch 'main' into feat/gunicorn-production-server	2025-11-24 12:08:57 +02:00
r-bit-rry	8fb237b6fb	adding warning	2025-11-17 11:53:12 +02:00
ehhuang	95b0493fae	chore: move src/llama_stack/ui to src/llama_stack_ui (#4068 ) # What does this PR do? This better separates UI from backend code, which was a point of confusion often for our beloved AI friends. ## Test Plan CI	2025-11-04 15:21:49 -08:00
Roy Belio	241e189fee	refactor: address PR feedback - improve naming, error handling, and documentation Address all feedback from PR #3962: Code Quality Improvements: - Rename `_uvicorn_run` → `_run_server` for accurate method naming - Refactor error handling: move Gunicorn fallback logic from `_run_with_gunicorn` to caller - Update comments to reflect both Uvicorn and Gunicorn behavior - Update test mock from `_uvicorn_run` to `_run_server` Environment Variable: - Change `LLAMA_STACK_DISABLE_GUNICORN` → `LLAMA_STACK_ENABLE_GUNICORN` - More intuitive positive logic (no double negatives) - Defaults to `true` on Unix systems - Clearer log messages distinguishing platform limitations vs explicit disable Documentation: - Remove unnecessary `uv sync --group unit --group test` from user docs - Clarify SQLite limitations: "SQLite only allows one writer at a time" - Accurate explanation: WAL mode enables concurrent reads but writes are serialized - Strong recommendation for PostgreSQL in production with high traffic Architecture: - Better separation of concerns: `_run_with_gunicorn` just executes, caller handles fallback - Exceptions propagate to caller for centralized decision making 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-04 16:29:47 +02:00
Ashwin Bharambe	b728307427	Merge branch 'main' into feat/gunicorn-production-server	2025-11-03 17:39:30 -08:00
Charlie Doern	30f8921240	fix: generate provider config when using --providers (#4044 ) # What does this PR do? call the sample_run_config method for providers that have it when generating a run config using `llama stack run --providers`. This will propagate API keys resolves #4032 ## Test Plan new unit test checks the output of using `--providers` to ensure `api_key` is in the config. manual testing: ``` ╰─ llama stack list-deps --providers=inference=remote::openai --format uv \| sh Using Python 3.12.11 environment at: venv Audited 7 packages in 8ms ╰─ llama stack run --providers=inference=remote::openai INFO 2025-11-03 14:33:02,094 llama_stack.cli.stack.run:161 cli: Writing generated config to: /Users/charliedoern/.llama/distributions/providers-run/run.yaml INFO 2025-11-03 14:33:02,096 llama_stack.cli.stack.run:169 cli: Using run configuration: /Users/charliedoern/.llama/distributions/providers-run/run.yaml INFO 2025-11-03 14:33:02,099 llama_stack.cli.stack.run:228 cli: HTTPS enabled with certificates: Key: None Cert: None INFO 2025-11-03 14:33:02,099 llama_stack.cli.stack.run:230 cli: Listening on 0.0.0.0:8321 INFO 2025-11-03 14:33:02,145 llama_stack.core.server.server:513 core::server: Run configuration: INFO 2025-11-03 14:33:02,146 llama_stack.core.server.server:516 core::server: apis: - inference image_name: providers-run providers: inference: - config: api_key: '********' base_url: https://api.openai.com/v1 provider_id: openai provider_type: remote::openai registered_resources: benchmarks: [] datasets: [] models: [] scoring_fns: [] shields: [] tool_groups: [] vector_stores: [] server: port: 8321 workers: 1 storage: backends: kv_default: db_path: /Users/charliedoern/.llama/distributions/providers-run/kvstore.db type: kv_sqlite sql_default: db_path: /Users/charliedoern/.llama/distributions/providers-run/sql_store.db type: sql_sqlite stores: conversations: backend: sql_default table_name: openai_conversations inference: backend: sql_default max_write_queue_size: 10000 num_writers: 4 table_name: inference_store metadata: backend: kv_default namespace: registry prompts: backend: kv_default namespace: prompts telemetry: enabled: false version: 2 INFO 2025-11-03 14:33:02,299 llama_stack.providers.utils.inference.inference_store:74 inference: Write queue disabled for SQLite to avoid concurrency issues INFO 2025-11-03 14:33:05,272 llama_stack.providers.utils.inference.openai_mixin:439 providers::utils: OpenAIInferenceAdapter.list_provider_model_ids() returned 105 models INFO 2025-11-03 14:33:05,368 uvicorn.error:84 uncategorized: Started server process [69109] INFO 2025-11-03 14:33:05,369 uvicorn.error:48 uncategorized: Waiting for application startup. INFO 2025-11-03 14:33:05,370 llama_stack.core.server.server:172 core::server: Starting up Llama Stack server (version: 0.3.0) INFO 2025-11-03 14:33:05,370 llama_stack.core.stack:495 core: starting registry refresh task INFO 2025-11-03 14:33:05,370 uvicorn.error:62 uncategorized: Application startup complete. INFO 2025-11-03 14:33:05,371 uvicorn.error:216 uncategorized: Uvicorn running on http://0.0.0.0:8321 (Press CTRL+C to quit) INFO 2025-11-03 14:34:19,242 uvicorn.access:473 uncategorized: 127.0.0.1:63102 - "POST /v1/chat/completions HTTP/1.1" 200 ``` client: ``` curl http://localhost:8321/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-5", "messages": [ {"role": "user", "content": "What is 1 + 2"} ] }' {"id":"...","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"3","refusal":null,"role":"assistant","annotations":[],"audio":null,"function_call":null,"tool_calls":null}}],"created":1762198455,"model":"openai/gpt-5","object":"chat.completion","service_tier":"default","system_fingerprint":null,"usage":{"completion_tokens":10,"prompt_tokens":13,"total_tokens":23,"completion_tokens_details":{"accepted_prediction_tokens":0,"audio_tokens":0,"reasoning_tokens":0,"rejected_prediction_tokens":0},"prompt_tokens_details":{"audio_tokens":0,"cached_tokens":0}}}% ``` --------- Signed-off-by: Charlie Doern <cdoern@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-11-03 11:37:58 -08:00
Roy Belio	47bd994824	Merge branch 'main' into feat/gunicorn-production-server	2025-11-02 16:13:15 +02:00
Roy Belio	4a75f10758	Update src/llama_stack/cli/stack/run.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-11-02 16:10:52 +02:00
Charlie Doern	93401836b7	feat: llama stack run --providers (#3989 ) Some checks failed SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 1s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Details SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Details Python Package Build Test / build (3.13) (push) Failing after 1s Details Test Llama Stack Build / generate-matrix (push) Successful in 3s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 5s Details Python Package Build Test / build (3.12) (push) Failing after 3s Details Pre-commit / pre-commit (push) Failing after 5s Details Vector IO Integration Tests / test-matrix (push) Failing after 5s Details Test Llama Stack Build / build-single-provider (push) Failing after 5s Details Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 4s Details Test Llama Stack Build / build-custom-container-distribution (push) Failing after 5s Details API Conformance Tests / check-schema-compatibility (push) Successful in 10s Details Unit Tests / unit-tests (3.13) (push) Failing after 4s Details Unit Tests / unit-tests (3.12) (push) Failing after 5s Details Test External API and Providers / test-external (venv) (push) Failing after 6s Details Test Llama Stack Build / build (push) Failing after 4s Details UI Tests / ui-tests (22) (push) Successful in 56s Details # What does this PR do? llama stack run --providers takes a list of providers in the format of api1=provider1,api2=provider2 this allows users to run with a simple list of providers. given the architecture of `create_app`, this run config needs to be written to disk. use ~/.llama/distribution/providers-run/run.yaml each time for consistency resolves #3956 ## Test Plan new unit tests to ensure --providers. Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-10-31 16:21:32 -07:00
Doug Edgar	e8cd8508b5	fix: handle missing external_providers_dir (#3974 ) Some checks failed SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 0s Details Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 1s Details Test External Providers Installed via Module / test-external-providers-from-module (venv) (push) Has been skipped Details SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 3s Details Python Package Build Test / build (3.13) (push) Failing after 1s Details Python Package Build Test / build (3.12) (push) Failing after 1s Details Pre-commit / pre-commit (push) Failing after 2s Details Integration Tests (Replay) / Integration Tests (, , , client=, ) (push) Failing after 4s Details Vector IO Integration Tests / test-matrix (push) Failing after 6s Details Unit Tests / unit-tests (3.12) (push) Failing after 4s Details Unit Tests / unit-tests (3.13) (push) Failing after 5s Details Test External API and Providers / test-external (venv) (push) Failing after 5s Details API Conformance Tests / check-schema-compatibility (push) Successful in 13s Details UI Tests / ui-tests (22) (push) Successful in 50s Details # What does this PR do? <!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. --> This PR fixes the handling of the external_providers_dir configuration field to align with its ongoing deprecation, in favor of the provider `module` specification approach. It addresses the issue in #3950, where using the default provided run.yaml config resulted in the `external_providers_dir` parameter being set to the literal string `None`, and crashing the llama-stack server when starting. <!-- If resolving an issue, uncomment and update the line below --> Closes #3950 ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed. --> - Built a new container image from `podman build . -f containers/Containerfile --build-arg DISTRO_NAME=starter --tag llama-stack:starter` - Tested it locally with `podman run -it localhost/llama-stack:starter` - Tested it on an OpenShift 4.19 cluster, deployed via the llama-stack-k8s-operator. Signed-off-by: Doug Edgar <dedgar@redhat.com>	2025-10-30 17:01:31 -07:00
ehhuang	0e384a55a1	feat: support `workers` in run config (#3992 ) # What does this PR do? ## Test Plan Set workers: 4 in run.yaml. Start server and observe logs multiple times.	2025-10-30 16:34:12 -07:00
Roy Belio	3e1d0060c1	fix: disable Gunicorn in telemetry tests to fix multi-process telemetry collection Telemetry tests use an OTLP collector that expects single-process telemetry spans. Gunicorn's multi-process architecture spawns multiple workers, each with separate telemetry instrumentation, preventing the test collector from capturing all spans. This commit adds LLAMA_STACK_DISABLE_GUNICORN environment variable support and sets it in telemetry test configuration to ensure single-process Uvicorn is used during tests while maintaining production multi-process behavior. Fixes failing tests: - test_streaming_chunk_count - test_telemetry_format_completeness	2025-10-30 18:01:47 +02:00
Roy Belio	e72583cd9c	feat(cli): use gunicorn to manage server workers on unix systems Implement Gunicorn + Uvicorn deployment strategy for Unix systems to provide multi-process parallelism and high-concurrency async request handling. Key Features: - Platform detection: Uses Gunicorn on Unix (Linux/macOS), falls back to Uvicorn on Windows - Worker management: Auto-calculates workers as (2 * CPU cores) + 1 with env var overrides (GUNICORN_WORKERS, WEB_CONCURRENCY) - Production optimizations: * Worker recycling (--max-requests, --max-requests-jitter) prevents memory leaks * Configurable worker connections (default: 1000 per worker) * Connection keepalive for improved performance * Automatic log level mapping from Python logging to Gunicorn * Optional --preload for memory efficiency (disabled by default) - IPv6 support: Proper bind address formatting for IPv6 addresses - SSL/TLS: Passes through certificate configuration from uvicorn_config - Comprehensive logging: Reports workers, capacity, and configuration details - Graceful fallback: Falls back to Uvicorn if Gunicorn not installed Configuration via Environment Variables: - GUNICORN_WORKERS / WEB_CONCURRENCY: Override worker count - GUNICORN_WORKER_CONNECTIONS: Concurrent connections per worker - GUNICORN_TIMEOUT: Worker timeout (default: 120s for async workers) - GUNICORN_KEEPALIVE: Connection keepalive (default: 5s) - GUNICORN_MAX_REQUESTS: Worker recycling interval (default: 10000) - GUNICORN_MAX_REQUESTS_JITTER: Randomize restart timing (default: 1000) - GUNICORN_PRELOAD: Enable app preloading for production (default: false) Based on best practices from: - DeepWiki analysis of encode/uvicorn and benoitc/gunicorn repositories - Medium article: "Mastering Gunicorn and Uvicorn: The Right Way to Deploy FastAPI Applications" Fixes: - Avoids worker multiplication anti-pattern (nested workers) - Proper IPv6 bind address formatting ([::]:port) - Correct Gunicorn parameter names (--keep-alive vs --keepalive) Dependencies: - Added gunicorn>=23.0.0 to pyproject.toml Co-Authored-By: Claude <noreply@anthropic.com>	2025-10-29 17:09:17 +02:00
Ashwin Bharambe	1d385b5b75	fix(mypy): resolve OpenAI SDK and provider type issues (#3936 ) ## Summary - Fix OpenAI SDK NotGiven/Omit type mismatches in embeddings calls - Fix incorrect OpenAIChatCompletionChunk import in vllm provider - Refactor to avoid type:ignore comments by using conditional kwargs ## Changes openai_mixin.py (9 errors fixed): - Build kwargs conditionally for embeddings.create() to avoid NotGiven/Omit mismatch - Only include parameters when they have actual values (not None) gemini.py (9 errors fixed): - Apply same conditional kwargs pattern - Add missing Any import vllm.py (2 errors fixed): - Use correct OpenAIChatCompletionChunk from llama_stack.apis.inference - Remove incorrect alias from openai package ## Technical Notes The OpenAI SDK has a type system quirk where `NOT_GIVEN` has type `NotGiven` but parameter signatures expect `Omit`. By only passing parameters with actual values, we avoid this mismatch entirely without needing `# type: ignore` comments. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude <noreply@anthropic.com>	2025-10-28 10:54:29 -07:00
Ashwin Bharambe	471b1b248b	chore(package): migrate to src/ layout (#3920 ) Migrates package structure to src/ layout following Python packaging best practices. All code moved from `llama_stack/` to `src/llama_stack/`. Public API unchanged - imports remain `import llama_stack.`. Updated build configs, pre-commit hooks, scripts, and GitHub workflows accordingly. All hooks pass, package builds cleanly. Developer note*: Reinstall after pulling: `pip install -e .`	2025-10-27 12:02:21 -07:00

17 commits