fix(dev): fix vllm inference recording (await models.list) (#3524)

mirror of https://github.com/meta-llama/llama-stack.git synced 2025-12-04 02:03:44 +00:00

# What does this PR do?

fix inference recording for vLLM

closes #3523 

## Test Plan

```
$ ./scripts/integration-tests.sh --stack-config server:ci-tests --setup vllm --subdirs inference --inference-mode record --pattern test_text_chat_completion_non_streaming

=== Llama Stack Integration Test Runner ===
Stack Config: server:ci-tests
Setup: vllm
Inference Mode: record
Test Suite: base
Test Subdirs: inference
Test Pattern: test_text_chat_completion_non_streaming

...

=== Applying Setup Environment Variables ===
Setting up environment variables:
export VLLM_URL='http://localhost:8000/v1'

=== Starting Llama Stack Server ===
Waiting for Llama Stack Server to start...
✅ Llama Stack Server started successfully

=== Running Integration Tests ===
Test subdirs to run: inference
Added test files from inference: 6 files

=== Running all collected tests in a single pytest command ===
Total test files: 6
+ pytest -s -v tests/integration/inference/test_openai_completion.py tests/integration/inference/test_batch_inference.py tests/integration/inference/test_openai_embeddings.py tests/integration/inference/test_text_inference.py tests/integration/inference/test_vision_inference.py tests/integration/inference/test_embedding.py --stack-config=server:ci-tests --inference-mode=record -k 'not( builtin_tool or safety_with_image or code_interpreter or test_rag or test_inference_store_tool_calls ) and test_text_chat_completion_non_streaming' --setup=vllm --color=yes --capture=tee-sys
INFO     2025-09-23 10:35:36,662 tests.integration.conftest:86 tests: Applying setup 'vllm'                                                           
======================================================= test session starts =======================================================
platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0 -- .../.venv/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.12.11', 'Platform': 'Linux-6.16.7-200.fc42.x86_64-x86_64-with-glibc2.41', 'Packages': {'pytest': '8.4.2', 'pluggy': '1.6.0'}, 'Plugins': {'html': '4.1.1', 'anyio': '4.9.0', 'timeout': '2.4.0', 'cov': '6.2.1', 'asyncio': '1.1.0', 'nbval': '0.11.0', 'socket': '0.7.0', 'json-report': '1.5.0', 'metadata': '3.1.1'}}
rootdir: ...
configfile: pyproject.toml
plugins: html-4.1.1, anyio-4.9.0, timeout-2.4.0, cov-6.2.1, asyncio-1.1.0, nbval-0.11.0, socket-0.7.0, json-report-1.5.0, metadata-3.1.1
asyncio: mode=Mode.AUTO, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 97 items / 95 deselected / 2 selected                                                                                   

tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_01] 
instantiating llama_stack_client
Port 8321 is already in use, assuming server is already running...
llama_stack_client instantiated in 0.044s
PASSED [ 50%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_02] PASSED [100%]

====================================================== slowest 10 durations =======================================================
1.62s call     tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_02]
0.93s call     tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_01]
0.62s setup    tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=vllm/Qwen/Qwen3-0.6B-inference:chat_completion:non_streaming_01]

(3 durations < 0.005s hidden.  Use -vv to show these durations.)
========================================== 2 passed, 95 deselected, 6 warnings in 3.26s ===========================================
+ exit_code=0
+ set +x
✅ All tests completed successfully
```

```
$ git status
...
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	tests/integration/recordings/responses/032f8c5a1289.json
	tests/integration/recordings/responses/c42baf6a3700.json
	tests/integration/recordings/responses/models-bd032f995f2a-fb68f5a6.json
...
```

This commit is contained in:

Matthew Farrellee

2025-09-23 12:56:33 -04:00

• committed by

GitHub

parent 62e0aef7bc

commit 2be869b3ef

No known key found for this signature in database

GPG key ID: B5690EEEBB952194

2 changed files with 2 additions and 2 deletions

									
										2

llama_stack/providers/remote/inference/vllm/vllm.py
									
										View file
										
					@ -504,7 +504,7 @@ class VLLMInferenceAdapter(OpenAIMixin, LiteLLMOpenAIMixin, Inference, ModelsPro

					        except ValueError:

					        except ValueError:

					            pass  # Ignore statically unknown model, will check live listing

					            pass  # Ignore statically unknown model, will check live listing

					        try:

					        try:

					            res = await self.client.models.list()

					            res = self.client.models.list()

					        except APIConnectionError as e:

					        except APIConnectionError as e:

					            raise ValueError(

					            raise ValueError(

					                f"Failed to connect to vLLM at {self.config.url}. Please check if vLLM is running and accessible at that URL."

					                f"Failed to connect to vLLM at {self.config.url}. Please check if vLLM is running and accessible at that URL."

									
										2

tests/unit/providers/inference/test_remote_vllm.py
									
										View file
										
					@ -62,7 +62,7 @@ from llama_stack.providers.remote.inference.vllm.vllm import (

					@pytest.fixture(scope="module")

					@pytest.fixture(scope="module")

					def mock_openai_models_list():

					def mock_openai_models_list():

					    with patch("openai.resources.models.AsyncModels.list", new_callable=AsyncMock) as mock_list:

					    with patch("openai.resources.models.AsyncModels.list") as mock_list:

					        yield mock_list

					        yield mock_list

Rows
Columns

fix(dev): fix vllm inference recording (await models.list) (#3524)

2 llama_stack/providers/remote/inference/vllm/vllm.py Unescape Escape View file

2 tests/unit/providers/inference/test_remote_vllm.py Unescape Escape View file

2

llama_stack/providers/remote/inference/vllm/vllm.py

View file

2

tests/unit/providers/inference/test_remote_vllm.py

View file