mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-27 16:42:00 +00:00
1.3 KiB
1.3 KiB
inline::vllm
Description
vLLM inference provider for high-performance model serving with PagedAttention and continuous batching.
Configuration
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
tensor_parallel_size |
<class 'int'> |
No | 1 | Number of tensor parallel replicas (number of GPUs to use). |
max_tokens |
<class 'int'> |
No | 4096 | Maximum number of tokens to generate. |
max_model_len |
<class 'int'> |
No | 4096 | Maximum context length to use during serving. |
max_num_seqs |
<class 'int'> |
No | 4 | Maximum parallel batch size for generation. |
enforce_eager |
<class 'bool'> |
No | False | Whether to use eager mode for inference (otherwise cuda graphs are used). |
gpu_memory_utilization |
<class 'float'> |
No | 0.3 | How much GPU memory will be allocated when this provider has finished loading, including memory that was already allocated before loading. |
Sample Configuration
tensor_parallel_size: ${env.TENSOR_PARALLEL_SIZE:=1}
max_tokens: ${env.MAX_TOKENS:=4096}
max_model_len: ${env.MAX_MODEL_LEN:=4096}
max_num_seqs: ${env.MAX_NUM_SEQS:=4}
enforce_eager: ${env.ENFORCE_EAGER:=False}
gpu_memory_utilization: ${env.GPU_MEMORY_UTILIZATION:=0.3}