refactor: address PR feedback - improve naming, error handling, and documentation

Address all feedback from PR #3962:

**Code Quality Improvements:**
- Rename `_uvicorn_run` → `_run_server` for accurate method naming
- Refactor error handling: move Gunicorn fallback logic from `_run_with_gunicorn` to caller
- Update comments to reflect both Uvicorn and Gunicorn behavior
- Update test mock from `_uvicorn_run` to `_run_server`

**Environment Variable:**
- Change `LLAMA_STACK_DISABLE_GUNICORN` → `LLAMA_STACK_ENABLE_GUNICORN`
- More intuitive positive logic (no double negatives)
- Defaults to `true` on Unix systems
- Clearer log messages distinguishing platform limitations vs explicit disable

**Documentation:**
- Remove unnecessary `uv sync --group unit --group test` from user docs
- Clarify SQLite limitations: "SQLite only allows one writer at a time"
- Accurate explanation: WAL mode enables concurrent reads but writes are serialized
- Strong recommendation for PostgreSQL in production with high traffic

**Architecture:**
- Better separation of concerns: `_run_with_gunicorn` just executes, caller handles fallback
- Exceptions propagate to caller for centralized decision making

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Roy Belio 2025-11-04 16:22:12 +02:00
parent 9ff881a28a
commit 241e189fee
10 changed files with 75 additions and 63 deletions

View file

@ -90,7 +90,8 @@ On Unix-based systems (Linux, macOS), the server automatically uses Gunicorn wit
**Important Notes**:
- On Windows, the server automatically falls back to single-process Uvicorn.
- **Database Race Condition**: When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions (e.g., "table already exists" errors) as multiple workers simultaneously attempt to create database tables. To avoid this issue in production, set `GUNICORN_PRELOAD=true` and ensure all dependencies are installed with `uv sync --group unit --group test`.
- **Database Race Condition**: When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions (e.g., "table already exists" errors) as multiple workers simultaneously attempt to create database tables. To avoid this issue in production, set `GUNICORN_PRELOAD=true`.
- **SQLite with Multiple Workers**: SQLite works with Gunicorn's multi-process mode for development and low-to-moderate traffic scenarios. The system automatically enables WAL (Write-Ahead Logging) mode and sets a 5-second busy timeout. However, **SQLite only allows one writer at a time** - even with WAL mode, write operations from multiple workers are serialized, causing workers to wait for database locks under concurrent write load. **For production deployments with high traffic or multiple workers, we strongly recommend using PostgreSQL or another production-grade database** for true concurrent write performance.
**Example production configuration:**
```bash

View file

@ -44,7 +44,9 @@ Configure Gunicorn behavior using environment variables:
- `GUNICORN_MAX_REQUESTS_JITTER`: Randomize worker restart timing (default: `1000`)
- `GUNICORN_PRELOAD`: Preload app before forking workers for memory efficiency (default: `true`, as set in `run.py` line 264)
**Important**: When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions. To avoid this, set `GUNICORN_PRELOAD=true` and install all dependencies with `uv sync --group unit --group test`.
**Important Notes**:
- When using multiple workers without `GUNICORN_PRELOAD=true`, you may encounter database initialization race conditions. To avoid this, set `GUNICORN_PRELOAD=true`.
- **SQLite with Multiple Workers**: SQLite works with Gunicorn's multi-process mode for development and low-to-moderate traffic scenarios. The system automatically enables WAL (Write-Ahead Logging) mode and sets a 5-second busy timeout. However, **SQLite only allows one writer at a time** - even with WAL mode, write operations from multiple workers are serialized, causing workers to wait for database locks under concurrent write load. **For production deployments with high traffic or multiple workers, we strongly recommend using PostgreSQL or another production-grade database** for true concurrent write performance.
**Example production configuration:**
```bash

View file

@ -1,7 +1,7 @@
---
description: "Agents
APIs for creating and interacting with agentic systems."
APIs for creating and interacting with agentic systems."
sidebar_label: Agents
title: Agents
---
@ -12,6 +12,6 @@ title: Agents
Agents
APIs for creating and interacting with agentic systems.
APIs for creating and interacting with agentic systems.
This section contains documentation for all available providers for the **agents** API.

View file

@ -1,14 +1,14 @@
---
description: "The Batches API enables efficient processing of multiple requests in a single operation,
particularly useful for processing large datasets, batch evaluation workflows, and
cost-effective inference at scale.
particularly useful for processing large datasets, batch evaluation workflows, and
cost-effective inference at scale.
The API is designed to allow use of openai client libraries for seamless integration.
The API is designed to allow use of openai client libraries for seamless integration.
This API provides the following extensions:
- idempotent batch creation
This API provides the following extensions:
- idempotent batch creation
Note: This API is currently under active development and may undergo changes."
Note: This API is currently under active development and may undergo changes."
sidebar_label: Batches
title: Batches
---
@ -18,14 +18,14 @@ title: Batches
## Overview
The Batches API enables efficient processing of multiple requests in a single operation,
particularly useful for processing large datasets, batch evaluation workflows, and
cost-effective inference at scale.
particularly useful for processing large datasets, batch evaluation workflows, and
cost-effective inference at scale.
The API is designed to allow use of openai client libraries for seamless integration.
The API is designed to allow use of openai client libraries for seamless integration.
This API provides the following extensions:
- idempotent batch creation
This API provides the following extensions:
- idempotent batch creation
Note: This API is currently under active development and may undergo changes.
Note: This API is currently under active development and may undergo changes.
This section contains documentation for all available providers for the **batches** API.

View file

@ -1,7 +1,7 @@
---
description: "Evaluations
Llama Stack Evaluation API for running evaluations on model and agent candidates."
Llama Stack Evaluation API for running evaluations on model and agent candidates."
sidebar_label: Eval
title: Eval
---
@ -12,6 +12,6 @@ title: Eval
Evaluations
Llama Stack Evaluation API for running evaluations on model and agent candidates.
Llama Stack Evaluation API for running evaluations on model and agent candidates.
This section contains documentation for all available providers for the **eval** API.

View file

@ -1,7 +1,7 @@
---
description: "Files
This API is used to upload documents that can be used with other Llama Stack APIs."
This API is used to upload documents that can be used with other Llama Stack APIs."
sidebar_label: Files
title: Files
---
@ -12,6 +12,6 @@ title: Files
Files
This API is used to upload documents that can be used with other Llama Stack APIs.
This API is used to upload documents that can be used with other Llama Stack APIs.
This section contains documentation for all available providers for the **files** API.

View file

@ -1,12 +1,12 @@
---
description: "Inference
Llama Stack Inference API for generating completions, chat completions, and embeddings.
Llama Stack Inference API for generating completions, chat completions, and embeddings.
This API provides the raw interface to the underlying models. Three kinds of models are supported:
- LLM models: these models generate \"raw\" and \"chat\" (conversational) completions.
- Embedding models: these models generate embeddings to be used for semantic search.
- Rerank models: these models reorder the documents based on their relevance to a query."
This API provides the raw interface to the underlying models. Three kinds of models are supported:
- LLM models: these models generate \"raw\" and \"chat\" (conversational) completions.
- Embedding models: these models generate embeddings to be used for semantic search.
- Rerank models: these models reorder the documents based on their relevance to a query."
sidebar_label: Inference
title: Inference
---
@ -17,11 +17,11 @@ title: Inference
Inference
Llama Stack Inference API for generating completions, chat completions, and embeddings.
Llama Stack Inference API for generating completions, chat completions, and embeddings.
This API provides the raw interface to the underlying models. Three kinds of models are supported:
- LLM models: these models generate "raw" and "chat" (conversational) completions.
- Embedding models: these models generate embeddings to be used for semantic search.
- Rerank models: these models reorder the documents based on their relevance to a query.
This API provides the raw interface to the underlying models. Three kinds of models are supported:
- LLM models: these models generate "raw" and "chat" (conversational) completions.
- Embedding models: these models generate embeddings to be used for semantic search.
- Rerank models: these models reorder the documents based on their relevance to a query.
This section contains documentation for all available providers for the **inference** API.

View file

@ -1,7 +1,7 @@
---
description: "Safety
OpenAI-compatible Moderations API."
OpenAI-compatible Moderations API."
sidebar_label: Safety
title: Safety
---
@ -12,6 +12,6 @@ title: Safety
Safety
OpenAI-compatible Moderations API.
OpenAI-compatible Moderations API.
This section contains documentation for all available providers for the **safety** API.