forked from phoenix-oss/llama-stack-mirror
		
	# What does this PR do?
This PR updates the inline vLLM inference provider in several
significant ways:
* Models are now attached at run time to instances of the provider via
the `.../models` API instead of hard-coding the model's full name into
the provider's YAML configuration.
* The provider supports models that are not Meta Llama models. Any model
that vLLM supports can be loaded by passing Huggingface coordinates in
the "provider_model_id" field. Custom fine-tuned versions of Meta Llama
models can be loaded by specifying a path on local disk in the
"provider_model_id".
* To implement full chat completions support, including tool calling and
constrained decoding, the provider now routes the `chat_completions` API
to a captive (i.e. called directly in-process, not via HTTPS) instance
of vLLM's OpenAI-compatible server .
* The `logprobs` parameter and completions API are also working.
## Test Plan
Existing tests in
`llama_stack/providers/tests/inference/test_text_inference.py` have good
coverage of the new functionality. These tests can be invoked as
follows:
```
cd llama-stack && pytest \
    -vvv \
    llama_stack/providers/tests/inference/test_text_inference.py \
    --providers inference=vllm \
    --inference-model meta-llama/Llama-3.2-3B-Instruct
====================================== test session starts ======================================
platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12
cachedir: .pytest_cache
metadata: {'Python': '3.12.8', 'Platform': 'Linux-6.8.0-1016-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.2'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'}
rootdir: /mnt/datadisk1/freiss/llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.8.0, html-4.1.1, metadata-3.1.1, asyncio-0.25.2
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 9 items                                                                               
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm] PASSED [ 11%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm] PASSED [ 22%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm] PASSED [ 33%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm] PASSED [ 44%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm] PASSED [ 55%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm] PASSED [ 66%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm] PASSED [ 77%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm] PASSED [ 88%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm] PASSED [100%]
=========================== 9 passed, 13 warnings in 97.18s (0:01:37) ===========================
```
## Sources
## Before submitting
- [X] Ran pre-commit to handle lint / formatting issues.
- [X] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
---------
Co-authored-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
		
	
			
		
			
				
	
	
		
			51 lines
		
	
	
	
		
			1.9 KiB
		
	
	
	
		
			Python
		
	
	
	
	
	
			
		
		
	
	
			51 lines
		
	
	
	
		
			1.9 KiB
		
	
	
	
		
			Python
		
	
	
	
	
	
| # Copyright (c) Meta Platforms, Inc. and affiliates.
 | |
| # All rights reserved.
 | |
| #
 | |
| # This source code is licensed under the terms described in the LICENSE file in
 | |
| # the root directory of this source tree.
 | |
| 
 | |
| from pydantic import BaseModel, Field
 | |
| 
 | |
| from llama_stack.schema_utils import json_schema_type
 | |
| 
 | |
| 
 | |
| @json_schema_type
 | |
| class VLLMConfig(BaseModel):
 | |
|     """Configuration for the vLLM inference provider.
 | |
| 
 | |
|     Note that the model name is no longer part of this static configuration.
 | |
|     You can bind an instance of this provider to a specific model with the
 | |
|     ``models.register()`` API call."""
 | |
| 
 | |
|     tensor_parallel_size: int = Field(
 | |
|         default=1,
 | |
|         description="Number of tensor parallel replicas (number of GPUs to use).",
 | |
|     )
 | |
|     max_tokens: int = Field(
 | |
|         default=4096,
 | |
|         description="Maximum number of tokens to generate.",
 | |
|     )
 | |
|     max_model_len: int = Field(default=4096, description="Maximum context length to use during serving.")
 | |
|     max_num_seqs: int = Field(default=4, description="Maximum parallel batch size for generation.")
 | |
|     enforce_eager: bool = Field(
 | |
|         default=False,
 | |
|         description="Whether to use eager mode for inference (otherwise cuda graphs are used).",
 | |
|     )
 | |
|     gpu_memory_utilization: float = Field(
 | |
|         default=0.3,
 | |
|         description=(
 | |
|             "How much GPU memory will be allocated when this provider has finished "
 | |
|             "loading, including memory that was already allocated before loading."
 | |
|         ),
 | |
|     )
 | |
| 
 | |
|     @classmethod
 | |
|     def sample_run_config(cls):
 | |
|         return {
 | |
|             "tensor_parallel_size": "${env.TENSOR_PARALLEL_SIZE:1}",
 | |
|             "max_tokens": "${env.MAX_TOKENS:4096}",
 | |
|             "max_model_len": "${env.MAX_MODEL_LEN:4096}",
 | |
|             "max_num_seqs": "${env.MAX_NUM_SEQS:4}",
 | |
|             "enforce_eager": "${env.ENFORCE_EAGER:False}",
 | |
|             "gpu_memory_utilization": "${env.GPU_MEMORY_UTILIZATION:0.3}",
 | |
|         }
 |