This commit is contained in:
Roy Belio 2025-12-03 01:04:09 +00:00 committed by GitHub
commit 343cb3248c
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
8 changed files with 233 additions and 14 deletions

View file

@ -255,6 +255,8 @@ server:
cors: true # Optional: Enable CORS (dev mode) or full config object cors: true # Optional: Enable CORS (dev mode) or full config object
``` ```
**Production Server**: On Unix-based systems (Linux, macOS), Llama Stack automatically uses Gunicorn with Uvicorn workers for production-grade multi-process performance. The server behavior can be customized using environment variables (e.g., `GUNICORN_WORKERS`, `GUNICORN_WORKER_CONNECTIONS`). See [Starting a Llama Stack Server](./starting_llama_stack_server#production-server-configuration-unixlinuxmacos) for complete configuration details.
### CORS Configuration ### CORS Configuration
CORS (Cross-Origin Resource Sharing) can be configured in two ways: CORS (Cross-Origin Resource Sharing) can be configured in two ways:

View file

@ -75,6 +75,32 @@ The following environment variables can be configured:
### Server Configuration ### Server Configuration
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`) - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
### Production Server Configuration (Unix/Linux/macOS only)
On Unix-based systems (Linux, macOS), the server automatically uses Gunicorn with Uvicorn workers for production-grade performance. The following environment variables control Gunicorn behavior:
- `GUNICORN_WORKERS` or `WEB_CONCURRENCY`: Number of worker processes (default: `(2 * CPU cores) + 1`)
- `GUNICORN_WORKER_CONNECTIONS`: Max concurrent connections per worker (default: `1000`)
- `GUNICORN_TIMEOUT`: Worker timeout in seconds (default: `120`)
- `GUNICORN_KEEPALIVE`: Connection keepalive in seconds (default: `5`)
- `GUNICORN_MAX_REQUESTS`: Restart workers after N requests to prevent memory leaks (default: `10000`)
- `GUNICORN_MAX_REQUESTS_JITTER`: Randomize worker restart timing (default: `1000`)
- `GUNICORN_PRELOAD`: Preload app before forking workers for memory efficiency (default: `true`)
**Important Notes**:
- On Windows, the server automatically falls back to single-process Uvicorn.
- **Database Race Condition**: When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions (e.g., "table already exists" errors) as multiple workers simultaneously attempt to create database tables. To avoid this issue in production, set `GUNICORN_PRELOAD=true`.
- **SQLite with Multiple Workers**: SQLite works with Gunicorn's multi-process mode for development and low-to-moderate traffic scenarios. The system automatically enables WAL (Write-Ahead Logging) mode and sets a 5-second busy timeout. However, **SQLite only allows one writer at a time** - even with WAL mode, write operations from multiple workers are serialized, causing workers to wait for database locks under concurrent write load. **For production deployments with high traffic or multiple workers, we strongly recommend using PostgreSQL or another production-grade database** for true concurrent write performance.
**Example production configuration:**
```bash
export GUNICORN_WORKERS=8 # 8 worker processes
export GUNICORN_WORKER_CONNECTIONS=1500 # 12,000 total concurrent capacity
export GUNICORN_PRELOAD=true # Enable for production
llama stack run starter
```
### API Keys for Hosted Providers ### API Keys for Hosted Providers
- `OPENAI_API_KEY`: OpenAI API key - `OPENAI_API_KEY`: OpenAI API key
- `FIREWORKS_API_KEY`: Fireworks API key - `FIREWORKS_API_KEY`: Fireworks API key

View file

@ -23,6 +23,41 @@ Another simple way to start interacting with Llama Stack is to just spin up a co
If you have built a container image and want to deploy it in a Kubernetes cluster instead of starting the Llama Stack server locally. See [Kubernetes Deployment Guide](../deploying/kubernetes_deployment) for more details. If you have built a container image and want to deploy it in a Kubernetes cluster instead of starting the Llama Stack server locally. See [Kubernetes Deployment Guide](../deploying/kubernetes_deployment) for more details.
## Production Server Configuration (Unix/Linux/macOS)
On Unix-based systems (Linux, macOS), Llama Stack automatically uses **Gunicorn with Uvicorn workers** for production-grade multi-process performance. This provides:
- **Multi-process concurrency**: Automatically scales to `(2 × CPU cores) + 1` workers
- **Worker recycling**: Prevents memory leaks by restarting workers periodically
- **High throughput**: Tested at 698+ requests/second with sub-millisecond response times
- **Graceful degradation**: Automatically falls back to single-process Uvicorn on Windows
### Configuration
Configure Gunicorn behavior using environment variables:
- `GUNICORN_WORKERS` or `WEB_CONCURRENCY`: Number of worker processes (default: `(2 * CPU cores) + 1`)
- `GUNICORN_WORKER_CONNECTIONS`: Max concurrent connections per worker (default: `1000`)
- `GUNICORN_TIMEOUT`: Worker timeout in seconds (default: `120`)
- `GUNICORN_KEEPALIVE`: Connection keepalive in seconds (default: `5`)
- `GUNICORN_MAX_REQUESTS`: Restart workers after N requests to prevent memory leaks (default: `10000`)
- `GUNICORN_MAX_REQUESTS_JITTER`: Randomize worker restart timing (default: `1000`)
- `GUNICORN_PRELOAD`: Preload app before forking workers for memory efficiency (default: `true`, as set in `run.py` line 264)
**Important Notes**:
- When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions. To avoid this, set `GUNICORN_PRELOAD=true`.
- **SQLite with Multiple Workers**: SQLite works with Gunicorn's multi-process mode for development and low-to-moderate traffic scenarios. The system automatically enables WAL (Write-Ahead Logging) mode and sets a 5-second busy timeout. However, **SQLite only allows one writer at a time** - even with WAL mode, write operations from multiple workers are serialized, causing workers to wait for database locks under concurrent write load. **For production deployments with high traffic or multiple workers, we strongly recommend using PostgreSQL or another production-grade database** for true concurrent write performance.
**Example production configuration:**
```bash
export GUNICORN_WORKERS=8 # 8 worker processes
export GUNICORN_WORKER_CONNECTIONS=1500 # 12,000 total concurrent capacity
export GUNICORN_PRELOAD=true # Enable for production
llama stack run starter
```
For more details on distribution-specific configuration, see the [Starter Distribution](./self_hosted_distro/starter) or [NVIDIA Distribution](./self_hosted_distro/nvidia) documentation.
## Configure logging ## Configure logging
Control log output via environment variables before starting the server. Control log output via environment variables before starting the server.

View file

@ -44,6 +44,7 @@ dependencies = [
"h11>=0.16.0", "h11>=0.16.0",
"python-multipart>=0.0.20", # For fastapi Form "python-multipart>=0.0.20", # For fastapi Form
"uvicorn>=0.34.0", # server "uvicorn>=0.34.0", # server
"gunicorn>=23.0.0", # production server for Unix systems
"opentelemetry-sdk>=1.30.0", # server "opentelemetry-sdk>=1.30.0", # server
"opentelemetry-exporter-otlp-proto-http>=1.30.0", # server "opentelemetry-exporter-otlp-proto-http>=1.30.0", # server
"aiosqlite>=0.21.0", # server - for metadata store "aiosqlite>=0.21.0", # server - for metadata store

View file

@ -181,9 +181,15 @@ class StackRun(Subcommand):
except AttributeError as e: except AttributeError as e:
self.parser.error(f"failed to parse config file '{config_file}':\n {e}") self.parser.error(f"failed to parse config file '{config_file}':\n {e}")
self._uvicorn_run(config_file, args) self._run_server(config_file, args)
def _uvicorn_run(self, config_file: Path | None, args: argparse.Namespace) -> None: def _run_server(self, config_file: Path | None, args: argparse.Namespace) -> None:
"""
Run the Llama Stack server using either Gunicorn (on Unix systems) or Uvicorn (on Windows or when disabled).
On Unix systems (Linux/macOS), defaults to Gunicorn with Uvicorn workers for production-grade multi-process
performance. Falls back to single-process Uvicorn on Windows or when LLAMA_STACK_ENABLE_GUNICORN=false.
"""
if not config_file: if not config_file:
self.parser.error("Config file is required") self.parser.error("Config file is required")
@ -229,20 +235,154 @@ class StackRun(Subcommand):
logger.info(f"Listening on {host}:{port}") logger.info(f"Listening on {host}:{port}")
# We need to catch KeyboardInterrupt because uvicorn's signal handling # We need to catch KeyboardInterrupt because both Uvicorn and Gunicorn's signal handling
# re-raises SIGINT signals using signal.raise_signal(), which Python # can raise SIGINT signals, which Python converts to KeyboardInterrupt. Without this catch,
# converts to KeyboardInterrupt. Without this catch, we'd get a confusing # we'd get a confusing stack trace when using Ctrl+C or kill -2 (SIGINT).
# stack trace when using Ctrl+C or kill -2 (SIGINT). # SIGTERM (kill -15) works fine without this because Python doesn't have a default handler for it.
# SIGTERM (kill -15) works fine without this because Python doesn't
# have a default handler for it.
#
# Another approach would be to ignore SIGINT entirely - let uvicorn handle it through its own
# signal handling but this is quite intrusive and not worth the effort.
try: try:
uvicorn.run("llama_stack.core.server.server:create_app", **uvicorn_config) # type: ignore[arg-type] # Check if Gunicorn should be enabled
# Default to true on Unix systems, can be disabled via environment variable
enable_gunicorn = os.getenv("LLAMA_STACK_ENABLE_GUNICORN", "true").lower() == "true" and sys.platform in (
"linux",
"darwin",
)
if enable_gunicorn:
# On Unix-like systems, use Gunicorn with Uvicorn workers for production-grade performance
try:
self._run_with_gunicorn(host, port, uvicorn_config)
except (FileNotFoundError, subprocess.CalledProcessError) as e:
# Gunicorn not available or failed to start - fall back to Uvicorn
logger.warning(f"Gunicorn unavailable or failed to start: {e}")
logger.info("Falling back to single-process Uvicorn server...")
uvicorn.run("llama_stack.core.server.server:create_app", **uvicorn_config) # type: ignore[arg-type]
else:
# Fall back to Uvicorn for:
# - Windows systems (Gunicorn not supported)
# - Unix systems with LLAMA_STACK_ENABLE_GUNICORN=false (for testing/debugging)
if sys.platform not in ("linux", "darwin"):
logger.info("Using single-process Uvicorn server (Gunicorn not supported on this platform)")
else:
logger.info(
"Using single-process Uvicorn server (Gunicorn disabled via LLAMA_STACK_ENABLE_GUNICORN=false)"
)
uvicorn.run("llama_stack.core.server.server:create_app", **uvicorn_config) # type: ignore[arg-type]
except (KeyboardInterrupt, SystemExit): except (KeyboardInterrupt, SystemExit):
logger.info("Received interrupt signal, shutting down gracefully...") logger.info("Received interrupt signal, shutting down gracefully...")
def _run_with_gunicorn(self, host: str | list[str], port: int, uvicorn_config: dict) -> None:
"""
Run the server using Gunicorn with Uvicorn workers.
This provides production-grade multi-process performance on Unix systems.
"""
import logging # allow-direct-logging
import multiprocessing
# Calculate number of workers: (2 * CPU cores) + 1 is a common formula
# Can be overridden by WEB_CONCURRENCY or GUNICORN_WORKERS environment variable
default_workers = (multiprocessing.cpu_count() * 2) + 1
num_workers = int(os.getenv("GUNICORN_WORKERS") or os.getenv("WEB_CONCURRENCY") or default_workers)
# Handle host configuration - Gunicorn expects a single bind address
# Uvicorn can accept a list of hosts, but Gunicorn binds to one address
bind_host = host[0] if isinstance(host, list) else host
# IPv6 addresses need to be wrapped in brackets
if ":" in bind_host and not bind_host.startswith("["):
bind_address = f"[{bind_host}]:{port}"
else:
bind_address = f"{bind_host}:{port}"
# Map Python logging level to Gunicorn log level string (from uvicorn_config)
log_level_map = {
logging.CRITICAL: "critical",
logging.ERROR: "error",
logging.WARNING: "warning",
logging.INFO: "info",
logging.DEBUG: "debug",
}
log_level = uvicorn_config.get("log_level", logging.INFO)
gunicorn_log_level = log_level_map.get(log_level, "info")
# Worker timeout - longer for async workers, configurable via env var
timeout = int(os.getenv("GUNICORN_TIMEOUT", "120"))
# Worker connections - concurrent connections per worker
worker_connections = int(os.getenv("GUNICORN_WORKER_CONNECTIONS", "1000"))
# Worker recycling to prevent memory leaks
max_requests = int(os.getenv("GUNICORN_MAX_REQUESTS", "10000"))
max_requests_jitter = int(os.getenv("GUNICORN_MAX_REQUESTS_JITTER", "1000"))
# Keep-alive for connection reuse
keepalive = int(os.getenv("GUNICORN_KEEPALIVE", "5"))
# Build Gunicorn command
gunicorn_command = [
"gunicorn",
"-k",
"uvicorn.workers.UvicornWorker",
"--workers",
str(num_workers),
"--worker-connections",
str(worker_connections),
"--bind",
bind_address,
"--timeout",
str(timeout),
"--keep-alive",
str(keepalive),
"--max-requests",
str(max_requests),
"--max-requests-jitter",
str(max_requests_jitter),
"--log-level",
gunicorn_log_level,
"--access-logfile",
"-", # Log to stdout
"--error-logfile",
"-", # Log to stderr
]
# Preload app for memory efficiency (disabled by default to avoid import issues)
# Enable with GUNICORN_PRELOAD=true for production deployments
if os.getenv("GUNICORN_PRELOAD", "true").lower() == "true":
gunicorn_command.append("--preload")
# Add SSL configuration if present (from uvicorn_config)
if uvicorn_config.get("ssl_keyfile") and uvicorn_config.get("ssl_certfile"):
gunicorn_command.extend(
[
"--keyfile",
uvicorn_config["ssl_keyfile"],
"--certfile",
uvicorn_config["ssl_certfile"],
]
)
if uvicorn_config.get("ssl_ca_certs"):
gunicorn_command.extend(["--ca-certs", uvicorn_config["ssl_ca_certs"]])
# Add the application
gunicorn_command.append("llama_stack.core.server.server:create_app")
# Log comprehensive configuration
logger.info(f"Starting Gunicorn server with {num_workers} workers on {bind_address}...")
logger.info("Using Uvicorn workers for ASGI application support")
logger.info(
f"Configuration: {worker_connections} connections/worker, {timeout}s timeout, {keepalive}s keepalive"
)
logger.info(f"Worker recycling: every {max_requests}±{max_requests_jitter} requests (prevents memory leaks)")
logger.info(f"Total concurrent capacity: {num_workers * worker_connections} connections")
# Warn if using SQLite with multiple workers
if num_workers > 1 and os.getenv("SQLITE_STORE_DIR"):
logger.warning("SQLite detected with multiple GUNICORN workers - writes will be serialized.")
# Execute the Gunicorn command
# If Gunicorn is not found or fails to start, raise the exception for the caller to handle
subprocess.run(gunicorn_command, check=True)
def _start_ui_development_server(self, stack_server_port: int): def _start_ui_development_server(self, stack_server_port: int):
logger.info("Attempting to start UI development server...") logger.info("Attempting to start UI development server...")
# Check if npm is available # Check if npm is available

View file

@ -18,6 +18,7 @@ from tests.integration.telemetry.collectors import InMemoryTelemetryManager, Otl
# TODO: Fix this to work with Automatic Instrumentation # TODO: Fix this to work with Automatic Instrumentation
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def telemetry_test_collector(): def telemetry_test_collector():
# Stack mode is set by integration-tests.sh based on STACK_CONFIG
stack_mode = os.environ.get("LLAMA_STACK_TEST_STACK_CONFIG_TYPE", "library_client") stack_mode = os.environ.get("LLAMA_STACK_TEST_STACK_CONFIG_TYPE", "library_client")
if stack_mode == "server": if stack_mode == "server":

View file

@ -292,8 +292,8 @@ def test_providers_flag_generates_config_with_api_keys():
enable_ui=False, enable_ui=False,
) )
# Mock _uvicorn_run to prevent starting a server # Mock _run_server to prevent starting a server
with patch.object(stack_run, "_uvicorn_run"): with patch.object(stack_run, "_run_server"):
stack_run._run_stack_run_cmd(args) stack_run._run_stack_run_cmd(args)
# Read the generated config file # Read the generated config file

14
uv.lock generated
View file

@ -1419,6 +1419,18 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/5a/96/44759eca966720d0f3e1b105c43f8ad4590c97bf8eb3cd489656e9590baa/grpcio-1.67.1-cp313-cp313-win_amd64.whl", hash = "sha256:fa0c739ad8b1996bd24823950e3cb5152ae91fca1c09cc791190bf1627ffefba", size = 4346042, upload-time = "2024-10-29T06:25:21.939Z" }, { url = "https://files.pythonhosted.org/packages/5a/96/44759eca966720d0f3e1b105c43f8ad4590c97bf8eb3cd489656e9590baa/grpcio-1.67.1-cp313-cp313-win_amd64.whl", hash = "sha256:fa0c739ad8b1996bd24823950e3cb5152ae91fca1c09cc791190bf1627ffefba", size = 4346042, upload-time = "2024-10-29T06:25:21.939Z" },
] ]
[[package]]
name = "gunicorn"
version = "23.0.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "packaging" },
]
sdist = { url = "https://files.pythonhosted.org/packages/34/72/9614c465dc206155d93eff0ca20d42e1e35afc533971379482de953521a4/gunicorn-23.0.0.tar.gz", hash = "sha256:f014447a0101dc57e294f6c18ca6b40227a4c90e9bdb586042628030cba004ec", size = 375031, upload-time = "2024-08-10T20:25:27.378Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/cb/7d/6dac2a6e1eba33ee43f318edbed4ff29151a49b5d37f080aad1e6469bca4/gunicorn-23.0.0-py3-none-any.whl", hash = "sha256:ec400d38950de4dfd418cff8328b2c8faed0edb0d517d3394e457c317908ca4d", size = 85029, upload-time = "2024-08-10T20:25:24.996Z" },
]
[[package]] [[package]]
name = "h11" name = "h11"
version = "0.16.0" version = "0.16.0"
@ -1998,6 +2010,7 @@ dependencies = [
{ name = "asyncpg" }, { name = "asyncpg" },
{ name = "fastapi" }, { name = "fastapi" },
{ name = "fire" }, { name = "fire" },
{ name = "gunicorn" },
{ name = "h11" }, { name = "h11" },
{ name = "httpx" }, { name = "httpx" },
{ name = "jinja2" }, { name = "jinja2" },
@ -2149,6 +2162,7 @@ requires-dist = [
{ name = "asyncpg" }, { name = "asyncpg" },
{ name = "fastapi", specifier = ">=0.115.0,<1.0" }, { name = "fastapi", specifier = ">=0.115.0,<1.0" },
{ name = "fire" }, { name = "fire" },
{ name = "gunicorn", specifier = ">=23.0.0" },
{ name = "h11", specifier = ">=0.16.0" }, { name = "h11", specifier = ">=0.16.0" },
{ name = "httpx" }, { name = "httpx" },
{ name = "jinja2", specifier = ">=3.1.6" }, { name = "jinja2", specifier = ">=3.1.6" },