refactor: address PR feedback - improve naming, error handling, and documentation

Address all feedback from PR #3962:

**Code Quality Improvements:**
- Rename `_uvicorn_run` → `_run_server` for accurate method naming
- Refactor error handling: move Gunicorn fallback logic from `_run_with_gunicorn` to caller
- Update comments to reflect both Uvicorn and Gunicorn behavior
- Update test mock from `_uvicorn_run` to `_run_server`

**Environment Variable:**
- Change `LLAMA_STACK_DISABLE_GUNICORN` → `LLAMA_STACK_ENABLE_GUNICORN`
- More intuitive positive logic (no double negatives)
- Defaults to `true` on Unix systems
- Clearer log messages distinguishing platform limitations vs explicit disable

**Documentation:**
- Remove unnecessary `uv sync --group unit --group test` from user docs
- Clarify SQLite limitations: "SQLite only allows one writer at a time"
- Accurate explanation: WAL mode enables concurrent reads but writes are serialized
- Strong recommendation for PostgreSQL in production with high traffic

**Architecture:**
- Better separation of concerns: `_run_with_gunicorn` just executes, caller handles fallback
- Exceptions propagate to caller for centralized decision making

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Roy Belio 2025-11-04 16:22:12 +02:00
parent 9ff881a28a
commit 241e189fee
10 changed files with 75 additions and 63 deletions

View file

@ -90,7 +90,8 @@ On Unix-based systems (Linux, macOS), the server automatically uses Gunicorn wit
**Important Notes**: **Important Notes**:
- On Windows, the server automatically falls back to single-process Uvicorn. - On Windows, the server automatically falls back to single-process Uvicorn.
- **Database Race Condition**: When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions (e.g., "table already exists" errors) as multiple workers simultaneously attempt to create database tables. To avoid this issue in production, set `GUNICORN_PRELOAD=true` and ensure all dependencies are installed with `uv sync --group unit --group test`. - **Database Race Condition**: When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions (e.g., "table already exists" errors) as multiple workers simultaneously attempt to create database tables. To avoid this issue in production, set `GUNICORN_PRELOAD=true`.
- **SQLite with Multiple Workers**: SQLite works with Gunicorn's multi-process mode for development and low-to-moderate traffic scenarios. The system automatically enables WAL (Write-Ahead Logging) mode and sets a 5-second busy timeout. However, **SQLite only allows one writer at a time** - even with WAL mode, write operations from multiple workers are serialized, causing workers to wait for database locks under concurrent write load. **For production deployments with high traffic or multiple workers, we strongly recommend using PostgreSQL or another production-grade database** for true concurrent write performance.
**Example production configuration:** **Example production configuration:**
```bash ```bash

View file

@ -44,7 +44,9 @@ Configure Gunicorn behavior using environment variables:
- `GUNICORN_MAX_REQUESTS_JITTER`: Randomize worker restart timing (default: `1000`) - `GUNICORN_MAX_REQUESTS_JITTER`: Randomize worker restart timing (default: `1000`)
- `GUNICORN_PRELOAD`: Preload app before forking workers for memory efficiency (default: `true`, as set in `run.py` line 264) - `GUNICORN_PRELOAD`: Preload app before forking workers for memory efficiency (default: `true`, as set in `run.py` line 264)
**Important**: When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions. To avoid this, set `GUNICORN_PRELOAD=true` and install all dependencies with `uv sync --group unit --group test`. **Important Notes**:
- When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions. To avoid this, set `GUNICORN_PRELOAD=true`.
- **SQLite with Multiple Workers**: SQLite works with Gunicorn's multi-process mode for development and low-to-moderate traffic scenarios. The system automatically enables WAL (Write-Ahead Logging) mode and sets a 5-second busy timeout. However, **SQLite only allows one writer at a time** - even with WAL mode, write operations from multiple workers are serialized, causing workers to wait for database locks under concurrent write load. **For production deployments with high traffic or multiple workers, we strongly recommend using PostgreSQL or another production-grade database** for true concurrent write performance.
**Example production configuration:** **Example production configuration:**
```bash ```bash

View file

@ -1,7 +1,7 @@
--- ---
description: "Agents description: "Agents
APIs for creating and interacting with agentic systems." APIs for creating and interacting with agentic systems."
sidebar_label: Agents sidebar_label: Agents
title: Agents title: Agents
--- ---
@ -12,6 +12,6 @@ title: Agents
Agents Agents
APIs for creating and interacting with agentic systems. APIs for creating and interacting with agentic systems.
This section contains documentation for all available providers for the **agents** API. This section contains documentation for all available providers for the **agents** API.

View file

@ -1,14 +1,14 @@
--- ---
description: "The Batches API enables efficient processing of multiple requests in a single operation, description: "The Batches API enables efficient processing of multiple requests in a single operation,
particularly useful for processing large datasets, batch evaluation workflows, and particularly useful for processing large datasets, batch evaluation workflows, and
cost-effective inference at scale. cost-effective inference at scale.
The API is designed to allow use of openai client libraries for seamless integration. The API is designed to allow use of openai client libraries for seamless integration.
This API provides the following extensions: This API provides the following extensions:
- idempotent batch creation - idempotent batch creation
Note: This API is currently under active development and may undergo changes." Note: This API is currently under active development and may undergo changes."
sidebar_label: Batches sidebar_label: Batches
title: Batches title: Batches
--- ---
@ -18,14 +18,14 @@ title: Batches
## Overview ## Overview
The Batches API enables efficient processing of multiple requests in a single operation, The Batches API enables efficient processing of multiple requests in a single operation,
particularly useful for processing large datasets, batch evaluation workflows, and particularly useful for processing large datasets, batch evaluation workflows, and
cost-effective inference at scale. cost-effective inference at scale.
The API is designed to allow use of openai client libraries for seamless integration. The API is designed to allow use of openai client libraries for seamless integration.
This API provides the following extensions: This API provides the following extensions:
- idempotent batch creation - idempotent batch creation
Note: This API is currently under active development and may undergo changes. Note: This API is currently under active development and may undergo changes.
This section contains documentation for all available providers for the **batches** API. This section contains documentation for all available providers for the **batches** API.

View file

@ -1,7 +1,7 @@
--- ---
description: "Evaluations description: "Evaluations
Llama Stack Evaluation API for running evaluations on model and agent candidates." Llama Stack Evaluation API for running evaluations on model and agent candidates."
sidebar_label: Eval sidebar_label: Eval
title: Eval title: Eval
--- ---
@ -12,6 +12,6 @@ title: Eval
Evaluations Evaluations
Llama Stack Evaluation API for running evaluations on model and agent candidates. Llama Stack Evaluation API for running evaluations on model and agent candidates.
This section contains documentation for all available providers for the **eval** API. This section contains documentation for all available providers for the **eval** API.

View file

@ -1,7 +1,7 @@
--- ---
description: "Files description: "Files
This API is used to upload documents that can be used with other Llama Stack APIs." This API is used to upload documents that can be used with other Llama Stack APIs."
sidebar_label: Files sidebar_label: Files
title: Files title: Files
--- ---
@ -12,6 +12,6 @@ title: Files
Files Files
This API is used to upload documents that can be used with other Llama Stack APIs. This API is used to upload documents that can be used with other Llama Stack APIs.
This section contains documentation for all available providers for the **files** API. This section contains documentation for all available providers for the **files** API.

View file

@ -1,12 +1,12 @@
--- ---
description: "Inference description: "Inference
Llama Stack Inference API for generating completions, chat completions, and embeddings. Llama Stack Inference API for generating completions, chat completions, and embeddings.
This API provides the raw interface to the underlying models. Three kinds of models are supported: This API provides the raw interface to the underlying models. Three kinds of models are supported:
- LLM models: these models generate \"raw\" and \"chat\" (conversational) completions. - LLM models: these models generate \"raw\" and \"chat\" (conversational) completions.
- Embedding models: these models generate embeddings to be used for semantic search. - Embedding models: these models generate embeddings to be used for semantic search.
- Rerank models: these models reorder the documents based on their relevance to a query." - Rerank models: these models reorder the documents based on their relevance to a query."
sidebar_label: Inference sidebar_label: Inference
title: Inference title: Inference
--- ---
@ -17,11 +17,11 @@ title: Inference
Inference Inference
Llama Stack Inference API for generating completions, chat completions, and embeddings. Llama Stack Inference API for generating completions, chat completions, and embeddings.
This API provides the raw interface to the underlying models. Three kinds of models are supported: This API provides the raw interface to the underlying models. Three kinds of models are supported:
- LLM models: these models generate "raw" and "chat" (conversational) completions. - LLM models: these models generate "raw" and "chat" (conversational) completions.
- Embedding models: these models generate embeddings to be used for semantic search. - Embedding models: these models generate embeddings to be used for semantic search.
- Rerank models: these models reorder the documents based on their relevance to a query. - Rerank models: these models reorder the documents based on their relevance to a query.
This section contains documentation for all available providers for the **inference** API. This section contains documentation for all available providers for the **inference** API.

View file

@ -1,7 +1,7 @@
--- ---
description: "Safety description: "Safety
OpenAI-compatible Moderations API." OpenAI-compatible Moderations API."
sidebar_label: Safety sidebar_label: Safety
title: Safety title: Safety
--- ---
@ -12,6 +12,6 @@ title: Safety
Safety Safety
OpenAI-compatible Moderations API. OpenAI-compatible Moderations API.
This section contains documentation for all available providers for the **safety** API. This section contains documentation for all available providers for the **safety** API.

View file

@ -181,9 +181,15 @@ class StackRun(Subcommand):
except AttributeError as e: except AttributeError as e:
self.parser.error(f"failed to parse config file '{config_file}':\n {e}") self.parser.error(f"failed to parse config file '{config_file}':\n {e}")
self._uvicorn_run(config_file, args) self._run_server(config_file, args)
def _uvicorn_run(self, config_file: Path | None, args: argparse.Namespace) -> None: def _run_server(self, config_file: Path | None, args: argparse.Namespace) -> None:
"""
Run the Llama Stack server using either Gunicorn (on Unix systems) or Uvicorn (on Windows or when disabled).
On Unix systems (Linux/macOS), defaults to Gunicorn with Uvicorn workers for production-grade multi-process
performance. Falls back to single-process Uvicorn on Windows or when LLAMA_STACK_ENABLE_GUNICORN=false.
"""
if not config_file: if not config_file:
self.parser.error("Config file is required") self.parser.error("Config file is required")
@ -229,27 +235,37 @@ class StackRun(Subcommand):
logger.info(f"Listening on {host}:{port}") logger.info(f"Listening on {host}:{port}")
# We need to catch KeyboardInterrupt because uvicorn's signal handling # We need to catch KeyboardInterrupt because both Uvicorn and Gunicorn's signal handling
# re-raises SIGINT signals using signal.raise_signal(), which Python # can raise SIGINT signals, which Python converts to KeyboardInterrupt. Without this catch,
# converts to KeyboardInterrupt. Without this catch, we'd get a confusing # we'd get a confusing stack trace when using Ctrl+C or kill -2 (SIGINT).
# stack trace when using Ctrl+C or kill -2 (SIGINT). # SIGTERM (kill -15) works fine without this because Python doesn't have a default handler for it.
# SIGTERM (kill -15) works fine without this because Python doesn't
# have a default handler for it.
#
# Another approach would be to ignore SIGINT entirely - let uvicorn handle it through its own
# signal handling but this is quite intrusive and not worth the effort.
try: try:
# Check if Gunicorn should be disabled (for testing or debugging) # Check if Gunicorn should be enabled
disable_gunicorn = os.getenv("LLAMA_STACK_DISABLE_GUNICORN", "false").lower() == "true" # Default to true on Unix systems, can be disabled via environment variable
enable_gunicorn = os.getenv("LLAMA_STACK_ENABLE_GUNICORN", "true").lower() == "true" and sys.platform in (
"linux",
"darwin",
)
if not disable_gunicorn and sys.platform in ("linux", "darwin"): if enable_gunicorn:
# On Unix-like systems, use Gunicorn with Uvicorn workers for production-grade performance # On Unix-like systems, use Gunicorn with Uvicorn workers for production-grade performance
try:
self._run_with_gunicorn(host, port, uvicorn_config) self._run_with_gunicorn(host, port, uvicorn_config)
except (FileNotFoundError, subprocess.CalledProcessError) as e:
# Gunicorn not available or failed to start - fall back to Uvicorn
logger.warning(f"Gunicorn unavailable or failed to start: {e}")
logger.info("Falling back to single-process Uvicorn server...")
uvicorn.run("llama_stack.core.server.server:create_app", **uvicorn_config) # type: ignore[arg-type]
else: else:
# On other systems (e.g., Windows), fall back to Uvicorn directly # Fall back to Uvicorn for:
# Also used when LLAMA_STACK_DISABLE_GUNICORN=true (for tests) # - Windows systems (Gunicorn not supported)
if disable_gunicorn: # - Unix systems with LLAMA_STACK_ENABLE_GUNICORN=false (for testing/debugging)
logger.info("Gunicorn disabled via LLAMA_STACK_DISABLE_GUNICORN environment variable") if sys.platform not in ("linux", "darwin"):
logger.info("Using single-process Uvicorn server (Gunicorn not supported on this platform)")
else:
logger.info(
"Using single-process Uvicorn server (Gunicorn disabled via LLAMA_STACK_ENABLE_GUNICORN=false)"
)
uvicorn.run("llama_stack.core.server.server:create_app", **uvicorn_config) # type: ignore[arg-type] uvicorn.run("llama_stack.core.server.server:create_app", **uvicorn_config) # type: ignore[arg-type]
except (KeyboardInterrupt, SystemExit): except (KeyboardInterrupt, SystemExit):
logger.info("Received interrupt signal, shutting down gracefully...") logger.info("Received interrupt signal, shutting down gracefully...")
@ -359,16 +375,9 @@ class StackRun(Subcommand):
logger.info(f"Worker recycling: every {max_requests}±{max_requests_jitter} requests (prevents memory leaks)") logger.info(f"Worker recycling: every {max_requests}±{max_requests_jitter} requests (prevents memory leaks)")
logger.info(f"Total concurrent capacity: {num_workers * worker_connections} connections") logger.info(f"Total concurrent capacity: {num_workers * worker_connections} connections")
try:
# Execute the Gunicorn command # Execute the Gunicorn command
# If Gunicorn is not found or fails to start, raise the exception for the caller to handle
subprocess.run(gunicorn_command, check=True) subprocess.run(gunicorn_command, check=True)
except FileNotFoundError:
logger.error("Error: 'gunicorn' command not found. Please ensure Gunicorn is installed.")
logger.error("Falling back to Uvicorn...")
uvicorn.run("llama_stack.core.server.server:create_app", **uvicorn_config) # type: ignore[arg-type]
except subprocess.CalledProcessError as e:
logger.error(f"Failed to start Gunicorn server. Error: {e}")
sys.exit(1)
def _start_ui_development_server(self, stack_server_port: int): def _start_ui_development_server(self, stack_server_port: int):
logger.info("Attempting to start UI development server...") logger.info("Attempting to start UI development server...")

View file

@ -295,8 +295,8 @@ def test_providers_flag_generates_config_with_api_keys():
enable_ui=False, enable_ui=False,
) )
# Mock _uvicorn_run to prevent starting a server # Mock _run_server to prevent starting a server
with patch.object(stack_run, "_uvicorn_run"): with patch.object(stack_run, "_run_server"):
stack_run._run_stack_run_cmd(args) stack_run._run_stack_run_cmd(args)
# Read the generated config file # Read the generated config file