mirror of https://github.com/meta-llama/llama-stack.git synced 2025-07-04 05:12:35 +00:00

docs: auto generated documentation for providers (#2543 )

# What does this PR do?

Simple approach to get some provider pages in the docs.

Add or update description fields in the provider configuration class
using Pydantic’s Field, ensuring these descriptions are clear and
complete, as they will be used to auto-generate provider documentation
via ./scripts/distro_codegen.py instead of editing the docs manually.

Signed-off-by: Sébastien Han <seb@redhat.com>

2025-06-30 15:13:20 +02:00

1.3 KiB

Raw Blame History

inline::vllm

Description

vLLM inference provider for high-performance model serving with PagedAttention and continuous batching.

Configuration

Field	Type	Required	Default	Description
`tensor_parallel_size`	`<class 'int'>`	No	1	Number of tensor parallel replicas (number of GPUs to use).
`max_tokens`	`<class 'int'>`	No	4096	Maximum number of tokens to generate.
`max_model_len`	`<class 'int'>`	No	4096	Maximum context length to use during serving.
`max_num_seqs`	`<class 'int'>`	No	4	Maximum parallel batch size for generation.
`enforce_eager`	`<class 'bool'>`	No	False	Whether to use eager mode for inference (otherwise cuda graphs are used).
`gpu_memory_utilization`	`<class 'float'>`	No	0.3	How much GPU memory will be allocated when this provider has finished loading, including memory that was already allocated before loading.

Sample Configuration

tensor_parallel_size: ${env.TENSOR_PARALLEL_SIZE:=1}
max_tokens: ${env.MAX_TOKENS:=4096}
max_model_len: ${env.MAX_MODEL_LEN:=4096}
max_num_seqs: ${env.MAX_NUM_SEQS:=4}
enforce_eager: ${env.ENFORCE_EAGER:=False}
gpu_memory_utilization: ${env.GPU_MEMORY_UTILIZATION:=0.3}

1.3 KiB Raw Blame History

inline::vllm

Description

Configuration

Sample Configuration

1.3 KiB

Raw Blame History