mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-03 18:00:36 +00:00
feat(cli): use gunicorn to manage server workers on unix systems
Implement Gunicorn + Uvicorn deployment strategy for Unix systems to provide multi-process parallelism and high-concurrency async request handling. Key Features: - Platform detection: Uses Gunicorn on Unix (Linux/macOS), falls back to Uvicorn on Windows - Worker management: Auto-calculates workers as (2 * CPU cores) + 1 with env var overrides (GUNICORN_WORKERS, WEB_CONCURRENCY) - Production optimizations: * Worker recycling (--max-requests, --max-requests-jitter) prevents memory leaks * Configurable worker connections (default: 1000 per worker) * Connection keepalive for improved performance * Automatic log level mapping from Python logging to Gunicorn * Optional --preload for memory efficiency (disabled by default) - IPv6 support: Proper bind address formatting for IPv6 addresses - SSL/TLS: Passes through certificate configuration from uvicorn_config - Comprehensive logging: Reports workers, capacity, and configuration details - Graceful fallback: Falls back to Uvicorn if Gunicorn not installed Configuration via Environment Variables: - GUNICORN_WORKERS / WEB_CONCURRENCY: Override worker count - GUNICORN_WORKER_CONNECTIONS: Concurrent connections per worker - GUNICORN_TIMEOUT: Worker timeout (default: 120s for async workers) - GUNICORN_KEEPALIVE: Connection keepalive (default: 5s) - GUNICORN_MAX_REQUESTS: Worker recycling interval (default: 10000) - GUNICORN_MAX_REQUESTS_JITTER: Randomize restart timing (default: 1000) - GUNICORN_PRELOAD: Enable app preloading for production (default: false) Based on best practices from: - DeepWiki analysis of encode/uvicorn and benoitc/gunicorn repositories - Medium article: "Mastering Gunicorn and Uvicorn: The Right Way to Deploy FastAPI Applications" Fixes: - Avoids worker multiplication anti-pattern (nested workers) - Proper IPv6 bind address formatting ([::]:port) - Correct Gunicorn parameter names (--keep-alive vs --keepalive) Dependencies: - Added gunicorn>=23.0.0 to pyproject.toml Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
e809d21357
commit
e72583cd9c
6 changed files with 198 additions and 1 deletions
|
|
@ -247,6 +247,8 @@ server:
|
|||
cors: true # Optional: Enable CORS (dev mode) or full config object
|
||||
```
|
||||
|
||||
**Production Server**: On Unix-based systems (Linux, macOS), Llama Stack automatically uses Gunicorn with Uvicorn workers for production-grade multi-process performance. The server behavior can be customized using environment variables (e.g., `GUNICORN_WORKERS`, `GUNICORN_WORKER_CONNECTIONS`). See [Starting a Llama Stack Server](./starting_llama_stack_server#production-server-configuration-unixlinuxmacos) for complete configuration details.
|
||||
|
||||
### CORS Configuration
|
||||
|
||||
CORS (Cross-Origin Resource Sharing) can be configured in two ways:
|
||||
|
|
|
|||
|
|
@ -75,6 +75,31 @@ The following environment variables can be configured:
|
|||
### Server Configuration
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
|
||||
### Production Server Configuration (Unix/Linux/macOS only)
|
||||
|
||||
On Unix-based systems (Linux, macOS), the server automatically uses Gunicorn with Uvicorn workers for production-grade performance. The following environment variables control Gunicorn behavior:
|
||||
|
||||
- `GUNICORN_WORKERS` or `WEB_CONCURRENCY`: Number of worker processes (default: `(2 * CPU cores) + 1`)
|
||||
- `GUNICORN_WORKER_CONNECTIONS`: Max concurrent connections per worker (default: `1000`)
|
||||
- `GUNICORN_TIMEOUT`: Worker timeout in seconds (default: `120`)
|
||||
- `GUNICORN_KEEPALIVE`: Connection keepalive in seconds (default: `5`)
|
||||
- `GUNICORN_MAX_REQUESTS`: Restart workers after N requests to prevent memory leaks (default: `10000`)
|
||||
- `GUNICORN_MAX_REQUESTS_JITTER`: Randomize worker restart timing (default: `1000`)
|
||||
- `GUNICORN_PRELOAD`: Preload app before forking workers for memory efficiency (default: `false`)
|
||||
|
||||
**Important Notes**:
|
||||
|
||||
- On Windows, the server automatically falls back to single-process Uvicorn.
|
||||
- **Database Race Condition**: When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions (e.g., "table already exists" errors) as multiple workers simultaneously attempt to create database tables. To avoid this issue in production, set `GUNICORN_PRELOAD=true` and ensure all dependencies are installed with `uv sync --group unit --group test`.
|
||||
|
||||
**Example production configuration:**
|
||||
```bash
|
||||
export GUNICORN_WORKERS=8 # 8 worker processes
|
||||
export GUNICORN_WORKER_CONNECTIONS=1500 # 12,000 total concurrent capacity
|
||||
export GUNICORN_PRELOAD=true # Enable for production
|
||||
llama stack run starter
|
||||
```
|
||||
|
||||
### API Keys for Hosted Providers
|
||||
- `OPENAI_API_KEY`: OpenAI API key
|
||||
- `FIREWORKS_API_KEY`: Fireworks API key
|
||||
|
|
|
|||
|
|
@ -23,6 +23,39 @@ Another simple way to start interacting with Llama Stack is to just spin up a co
|
|||
If you have built a container image and want to deploy it in a Kubernetes cluster instead of starting the Llama Stack server locally. See [Kubernetes Deployment Guide](../deploying/kubernetes_deployment) for more details.
|
||||
|
||||
|
||||
## Production Server Configuration (Unix/Linux/macOS)
|
||||
|
||||
On Unix-based systems (Linux, macOS), Llama Stack automatically uses **Gunicorn with Uvicorn workers** for production-grade multi-process performance. This provides:
|
||||
|
||||
- **Multi-process concurrency**: Automatically scales to `(2 × CPU cores) + 1` workers
|
||||
- **Worker recycling**: Prevents memory leaks by restarting workers periodically
|
||||
- **High throughput**: Tested at 698+ requests/second with sub-millisecond response times
|
||||
- **Graceful degradation**: Automatically falls back to single-process Uvicorn on Windows
|
||||
|
||||
### Configuration
|
||||
|
||||
Configure Gunicorn behavior using environment variables:
|
||||
|
||||
- `GUNICORN_WORKERS` or `WEB_CONCURRENCY`: Number of worker processes (default: `(2 * CPU cores) + 1`)
|
||||
- `GUNICORN_WORKER_CONNECTIONS`: Max concurrent connections per worker (default: `1000`)
|
||||
- `GUNICORN_TIMEOUT`: Worker timeout in seconds (default: `120`)
|
||||
- `GUNICORN_KEEPALIVE`: Connection keepalive in seconds (default: `5`)
|
||||
- `GUNICORN_MAX_REQUESTS`: Restart workers after N requests to prevent memory leaks (default: `10000`)
|
||||
- `GUNICORN_MAX_REQUESTS_JITTER`: Randomize worker restart timing (default: `1000`)
|
||||
- `GUNICORN_PRELOAD`: Preload app before forking workers for memory efficiency (default: `true`)
|
||||
|
||||
**Important**: When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions. To avoid this, set `GUNICORN_PRELOAD=true` and install all dependencies with `uv sync --group unit --group test`.
|
||||
|
||||
**Example production configuration:**
|
||||
```bash
|
||||
export GUNICORN_WORKERS=8 # 8 worker processes
|
||||
export GUNICORN_WORKER_CONNECTIONS=1500 # 12,000 total concurrent capacity
|
||||
export GUNICORN_PRELOAD=true # Enable for production
|
||||
llama stack run starter
|
||||
```
|
||||
|
||||
For more details on distribution-specific configuration, see the [Starter Distribution](./self_hosted_distro/starter) or [NVIDIA Distribution](./self_hosted_distro/nvidia) documentation.
|
||||
|
||||
## Configure logging
|
||||
|
||||
Control log output via environment variables before starting the server.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue