fix: Gracefully handle no choices in remote vLLM response (#1424)

# What does this PR do?

This gracefully handles the case where the vLLM server responded to a
completion request with no choices, which can happen in certain vLLM
error situations. Previously, we'd error out with a stack trace about a
list index out of range. Now, we just log a warning to the user and move
past any chunks with an empty choices list.

A specific example of the type of stack trace this fixes:

```
  File "/app/llama-stack-source/llama_stack/providers/remote/inference/vllm/vllm.py", line 170, in _process_vllm_chat_completion_stream_response
    choice = chunk.choices[0]
             ~~~~~~~~~~~~~^^^
IndexError: list index out of range
```

Now, instead of erroring out with that stack trace, we log a warning
that vLLM failed to generate any completions and alert the user to check
the vLLM server logs for details.

This is related to #1277 and addresses the stack trace shown in that
issue, although does not in and of itself change the functional behavior
of vLLM tool calling.

## Test Plan

As part of this fix, I added new unit tests to trigger this same error
and verify it no longer happens. That is
`test_process_vllm_chat_completion_stream_response_no_choices` in the
new `tests/unit/providers/inference/test_remote_vllm.py`. I also added a
couple of more tests to trigger and verify the last couple of remote
vllm provider bug fixes - specifically a test for #1236 (builtin tool
calling) and #1325 (vLLM <= v0.6.3).

This required fixing the signature of
`_process_vllm_chat_completion_stream_response` to accept the actual
type of chunks it was getting passed - specifically changing from our
openai_compat `OpenAICompatCompletionResponse` to
`openai.types.chat.chat_completion_chunk.ChatCompletionChunk`. It was
not actually getting passed `OpenAICompatCompletionResponse` objects
before, and was using attributes that didn't exist on those objects. So,
the signature now matches the type of object it's actually passed.

Run these new unit tests like this:

```
pytest tests/unit/providers/inference/test_remote_vllm.py
```

Additionally, I ensured the existing `test_text_inference.py` tests
passed via:

```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v tests/integration/inference/test_text_inference.py \
--inference-model "meta-llama/Llama-3.2-3B-Instruct" \
--vision-inference-model ""
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
This commit is contained in:
Ben Browning 2025-03-05 15:07:54 -05:00 committed by GitHub
parent bcc5370d2e
commit 9c4074ed49
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 150 additions and 2 deletions

View file

@ -8,6 +8,9 @@ import logging
from typing import AsyncGenerator, List, Optional, Union
from openai import OpenAI
from openai.types.chat.chat_completion_chunk import (
ChatCompletionChunk as OpenAIChatCompletionChunk,
)
from llama_stack.apis.common.content_types import (
InterleavedContent,
@ -49,7 +52,6 @@ from llama_stack.providers.utils.inference.model_registry import (
build_hf_repo_model_entry,
)
from llama_stack.providers.utils.inference.openai_compat import (
OpenAICompatCompletionResponse,
UnparseableToolCall,
convert_message_to_openai_dict,
convert_tool_call,
@ -155,11 +157,14 @@ def _convert_to_vllm_finish_reason(finish_reason: str) -> StopReason:
async def _process_vllm_chat_completion_stream_response(
stream: AsyncGenerator[OpenAICompatCompletionResponse, None],
stream: AsyncGenerator[OpenAIChatCompletionChunk, None],
) -> AsyncGenerator:
event_type = ChatCompletionResponseEventType.start
tool_call_buf = UnparseableToolCall()
async for chunk in stream:
if not chunk.choices:
log.warning("vLLM failed to generation any completions - check the vLLM server logs for an error.")
continue
choice = chunk.choices[0]
if choice.finish_reason:
args_str = tool_call_buf.arguments