llama-stack-mirror/llama_stack/providers/remote/inference/vllm
Matthew Farrellee 2be869b3ef
fix(dev): fix vllm inference recording (await models.list) (#3524)
# What does this PR do?

fix inference recording for vLLM

closes #3523 

## Test Plan

```
$ ./scripts/integration-tests.sh --stack-config server:ci-tests --setup vllm --subdirs inference --inference-mode record --pattern test_text_chat_completion_non_streaming

=== Llama Stack Integration Test Runner ===
Stack Config: server:ci-tests
Setup: vllm
Inference Mode: record
Test Suite: base
Test Subdirs: inference
Test Pattern: test_text_chat_completion_non_streaming

...

=== Applying Setup Environment Variables ===
Setting up environment variables:
export VLLM_URL='http://localhost:8000/v1'

=== Starting Llama Stack Server ===
Waiting for Llama Stack Server to start...
 Llama Stack Server started successfully

=== Running Integration Tests ===
Test subdirs to run: inference
Added test files from inference: 6 files

=== Running all collected tests in a single pytest command ===
Total test files: 6
+ pytest -s -v tests/integration/inference/test_openai_completion.py tests/integration/inference/test_batch_inference.py tests/integration/inference/test_openai_embeddings.py tests/integration/inference/test_text_inference.py tests/integration/inference/test_vision_inference.py tests/integration/inference/test_embedding.py --stack-config=server:ci-tests --inference-mode=record -k 'not( builtin_tool or safety_with_image or code_interpreter or test_rag or test_inference_store_tool_calls ) and test_text_chat_completion_non_streaming' --setup=vllm --color=yes --capture=tee-sys
INFO     2025-09-23 10:35:36,662 tests.integration.conftest:86 tests: Applying setup 'vllm'                                                           
======================================================= test session starts =======================================================
platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0 -- .../.venv/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.12.11', 'Platform': 'Linux-6.16.7-200.fc42.x86_64-x86_64-with-glibc2.41', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'html': '4.1.1', 'anyio': '4.9.0', 'timeout': '2.4.0', 'cov': '6.2.1', 'asyncio': '1.1.0', 'nbval': '0.11.0', 'socket': '0.7.0', 'json-report': '1.5.0', 'metadata': '3.1.1'}}
rootdir: ...
configfile: pyproject.toml
plugins: html-4.1.1, anyio-4.9.0, timeout-2.4.0, cov-6.2.1, asyncio-1.1.0, nbval-0.11.0, socket-0.7.0, json-report-1.5.0, metadata-3.1.1
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 97 items / 95 deselected / 2 selected                                                                                   

tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_01] 
instantiating llama_stack_client
Port 8321 is already in use, assuming server is already running...
llama_stack_client instantiated in 0.044s
PASSED [ 50%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_02] PASSED [100%]

====================================================== slowest 10 durations =======================================================
1.62s call     tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_02]
0.93s call     tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_01]
0.62s setup    tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_01]

(3 durations < 0.005s hidden.  Use -vv to show these durations.)
========================================== 2 passed, 95 deselected, 6 warnings in 3.26s ===========================================
+ exit_code=0
+ set +x
 All tests completed successfully
```

```
$ git status
...
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	tests/integration/recordings/responses/032f8c5a1289.json
	tests/integration/recordings/responses/c42baf6a3700.json
	tests/integration/recordings/responses/models-bd032f995f2a-fb68f5a6.json
...
```
2025-09-23 12:56:33 -04:00
..
__init__.py feat: Add dynamic authentication token forwarding support for vLLM (#3388) 2025-09-18 11:13:55 +02:00
config.py feat(registry): make the Stack query providers for model listing (#2862) 2025-07-24 10:39:53 -07:00
vllm.py fix(dev): fix vllm inference recording (await models.list) (#3524) 2025-09-23 12:56:33 -04:00