Merge bd4a77ee2d into 4237eb4aaa

2025-12-05 18:27:22 +00:00 · 2025-12-03 01:04:09 +00:00 · 2025-12-03 01:04:09 +00:00 · 343cb3248c
commit 343cb3248c
parent 4237eb4aaa bd4a77ee2d
8 changed files with 233 additions and 14 deletions
--- a/docs/docs/distributions/configuration.mdx
+++ b/docs/docs/distributions/configuration.mdx
@ -255,6 +255,8 @@ server:
  cors: true  # Optional: Enable CORS (dev mode) or full config object
 ```

+**Production Server**: On Unix-based systems (Linux, macOS), Llama Stack automatically uses Gunicorn with Uvicorn workers for production-grade multi-process performance. The server behavior can be customized using environment variables (e.g., `GUNICORN_WORKERS`, `GUNICORN_WORKER_CONNECTIONS`). See [Starting a Llama Stack Server](./starting_llama_stack_server#production-server-configuration-unixlinuxmacos) for complete configuration details.
+
 ### CORS Configuration

 CORS (Cross-Origin Resource Sharing) can be configured in two ways:
--- a/docs/docs/distributions/self_hosted_distro/starter.md
+++ b/docs/docs/distributions/self_hosted_distro/starter.md
@ -75,6 +75,32 @@ The following environment variables can be configured:
 ### Server Configuration
 - `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)

+### Production Server Configuration (Unix/Linux/macOS only)
+
+On Unix-based systems (Linux, macOS), the server automatically uses Gunicorn with Uvicorn workers for production-grade performance. The following environment variables control Gunicorn behavior:
+
+- `GUNICORN_WORKERS` or `WEB_CONCURRENCY`: Number of worker processes (default: `(2 * CPU cores) + 1`)
+- `GUNICORN_WORKER_CONNECTIONS`: Max concurrent connections per worker (default: `1000`)
+- `GUNICORN_TIMEOUT`: Worker timeout in seconds (default: `120`)
+- `GUNICORN_KEEPALIVE`: Connection keepalive in seconds (default: `5`)
+- `GUNICORN_MAX_REQUESTS`: Restart workers after N requests to prevent memory leaks (default: `10000`)
+- `GUNICORN_MAX_REQUESTS_JITTER`: Randomize worker restart timing (default: `1000`)
+- `GUNICORN_PRELOAD`: Preload app before forking workers for memory efficiency (default: `true`)
+
+**Important Notes**:
+
+- On Windows, the server automatically falls back to single-process Uvicorn.
+- **Database Race Condition**: When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions (e.g., "table already exists" errors) as multiple workers simultaneously attempt to create database tables. To avoid this issue in production, set `GUNICORN_PRELOAD=true`.
+- **SQLite with Multiple Workers**: SQLite works with Gunicorn's multi-process mode for development and low-to-moderate traffic scenarios. The system automatically enables WAL (Write-Ahead Logging) mode and sets a 5-second busy timeout. However, **SQLite only allows one writer at a time** - even with WAL mode, write operations from multiple workers are serialized, causing workers to wait for database locks under concurrent write load. **For production deployments with high traffic or multiple workers, we strongly recommend using PostgreSQL or another production-grade database** for true concurrent write performance.
+
+**Example production configuration:**
+```bash
+export GUNICORN_WORKERS=8              # 8 worker processes
+export GUNICORN_WORKER_CONNECTIONS=1500 # 12,000 total concurrent capacity
+export GUNICORN_PRELOAD=true           # Enable for production
+llama stack run starter
+```
+
 ### API Keys for Hosted Providers
 - `OPENAI_API_KEY`: OpenAI API key
 - `FIREWORKS_API_KEY`: Fireworks API key
--- a/docs/docs/distributions/starting_llama_stack_server.mdx
+++ b/docs/docs/distributions/starting_llama_stack_server.mdx
@ -23,6 +23,41 @@ Another simple way to start interacting with Llama Stack is to just spin up a co
 If you have built a container image and want to deploy it in a Kubernetes cluster instead of starting the Llama Stack server locally. See [Kubernetes Deployment Guide](../deploying/kubernetes_deployment) for more details.


+## Production Server Configuration (Unix/Linux/macOS)
+
+On Unix-based systems (Linux, macOS), Llama Stack automatically uses **Gunicorn with Uvicorn workers** for production-grade multi-process performance. This provides:
+
+- **Multi-process concurrency**: Automatically scales to `(2 × CPU cores) + 1` workers
+- **Worker recycling**: Prevents memory leaks by restarting workers periodically
+- **High throughput**: Tested at 698+ requests/second with sub-millisecond response times
+- **Graceful degradation**: Automatically falls back to single-process Uvicorn on Windows
+
+### Configuration
+
+Configure Gunicorn behavior using environment variables:
+
+- `GUNICORN_WORKERS` or `WEB_CONCURRENCY`: Number of worker processes (default: `(2 * CPU cores) + 1`)
+- `GUNICORN_WORKER_CONNECTIONS`: Max concurrent connections per worker (default: `1000`)
+- `GUNICORN_TIMEOUT`: Worker timeout in seconds (default: `120`)
+- `GUNICORN_KEEPALIVE`: Connection keepalive in seconds (default: `5`)
+- `GUNICORN_MAX_REQUESTS`: Restart workers after N requests to prevent memory leaks (default: `10000`)
+- `GUNICORN_MAX_REQUESTS_JITTER`: Randomize worker restart timing (default: `1000`)
+- `GUNICORN_PRELOAD`: Preload app before forking workers for memory efficiency (default: `true`, as set in `run.py` line 264)
+
+**Important Notes**:
+- When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions. To avoid this, set `GUNICORN_PRELOAD=true`.
+- **SQLite with Multiple Workers**: SQLite works with Gunicorn's multi-process mode for development and low-to-moderate traffic scenarios. The system automatically enables WAL (Write-Ahead Logging) mode and sets a 5-second busy timeout. However, **SQLite only allows one writer at a time** - even with WAL mode, write operations from multiple workers are serialized, causing workers to wait for database locks under concurrent write load. **For production deployments with high traffic or multiple workers, we strongly recommend using PostgreSQL or another production-grade database** for true concurrent write performance.
+
+**Example production configuration:**
+```bash
+export GUNICORN_WORKERS=8              # 8 worker processes
+export GUNICORN_WORKER_CONNECTIONS=1500 # 12,000 total concurrent capacity
+export GUNICORN_PRELOAD=true           # Enable for production
+llama stack run starter
+```
+
+For more details on distribution-specific configuration, see the [Starter Distribution](./self_hosted_distro/starter) or [NVIDIA Distribution](./self_hosted_distro/nvidia) documentation.
+
 ## Configure logging

 Control log output via environment variables before starting the server.