Commit graph

8 commits

Author SHA1 Message Date
Ilya Kolchinsky
43d4447ff0
fix: remote vLLM tool execution now works when the last chunk contains the call arguments (#2112)
# What does this PR do?
Closes #2111.
Fixes an error causing Llama Stack to just return `<tool_call>` and
complete the turn without actually executing the tool. See the issue
description for more detail.

## Test Plan
1) Ran existing unit tests
2) Added a dedicated test verifying correct behavior in this edge case
3) Ran the code snapshot from #2111
2025-05-14 11:38:00 +02:00
Ihar Hrachyshka
9e6561a1ec
chore: enable pyupgrade fixes (#1806)
# What does this PR do?

The goal of this PR is code base modernization.

Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)

Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-05-01 14:23:50 -07:00
Derek Higgins
c8797f1125
fix: Including tool call in chat (#1931)
Include the tool call details with the chat when doing Rag with Remote
vllm

Fixes: #1929

With this PR the tool call is included in the chat returned to vllm, the
model (meta-llama/Llama-3.1-8B-Instruct) the returns the answer as
expected.

Signed-off-by: Derek Higgins <derekh@redhat.com>
2025-04-24 16:59:10 -07:00
Daniel Alvarez Sanchez
b5a9ef4c6d
fix: Do not send an empty 'tools' list to remote vllm (#1957)
Fixes: #1955

Since 0.2.0, the vLLM gets an empty list (vs ``None``in 0.1.9 and
before) when there are no tools configured which causes the issue
described in #1955 p. This patch avoids sending the 'tools' param to the
vLLM altogether instead of an empty list.

It also adds a small unit test to avoid regressions.

The OpenAI
[specification](https://platform.openai.com/docs/api-reference/chat/create)
does not explicitly state that the list cannot be empty but I found this
out through experimentation and it might depend on the actual remote
vllm. In any case, as this parameter is Optional, is best to skip it
altogether if there's no tools configured.

Signed-off-by: Daniel Alvarez <dalvarez@redhat.com>
2025-04-15 20:31:12 -04:00
yyymeta
d117bfe597
feat: [new open benchmark] DocVQA (#1647)
# What does this PR do?
DocVQA asks model to look a a picture, then answer a question given in
text, with a text answer by text information in the picture. these
questions often require understanding of relative positions of texts
within the picture.

original dataset is defined in the "Task1" of
https://www.docvqa.org/datasets


## Test Plan
setup llama server with 

```
llama stack run ./llama_stack/templates/open-benchmark/run.yaml
```


then send traffic:

```
 llama-stack-client eval run-benchmark "meta-reference-docvqa"  --model-id   meta-llama/Llama-3.3-70B-Instruct     --output-dir /tmp/gpqa    --num-examples   200
```
2025-03-19 14:56:14 -07:00
Yuan Tang
0bdfc71f8d
test: Bump slow_callback_duration to 200ms to avoid flaky remote vLLM unit tests (#1675)
# What does this PR do?

This avoids flaky timeout issue observed in CI builds, e.g.
3891286596

## Test Plan

Ran multiple times and pass consistently.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-17 21:33:04 -07:00
Ben Browning
d86a893ead
fix: Swap to AsyncOpenAI client in remote vllm provider (#1459)
# What does this PR do?

This switches from an OpenAI client to the AsyncOpenAI client in the
remote vllm provider. The main benefit of this is that instead of each
client call being a blocking operation that was blocking our server
event loop, the client calls are now async operations that do not block
the event loop.

The actual fix is quite simple and straightforward. Creating a reliable
reproducer of this with a unit test that verifies we were blocking the
event loop before and are not blocking it any longer was a bit harder.
Some other inference providers have this same issue, so we may want to
make that simple delayed http server a bit more generic and pull it into
a common place as other inference providers get fixed.

(Closes #1457)

## Test Plan

I verified the unit tests and test_text_inference tests pass with this
change like below:

```
python -m pytest -v tests/unit
```

```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v -s \
tests/integration/inference/test_text_inference.py \
--text-model "meta-llama/Llama-3.2-3B-Instruct"
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-03-07 14:48:00 -05:00
Ben Browning
9c4074ed49
fix: Gracefully handle no choices in remote vLLM response (#1424)
# What does this PR do?

This gracefully handles the case where the vLLM server responded to a
completion request with no choices, which can happen in certain vLLM
error situations. Previously, we'd error out with a stack trace about a
list index out of range. Now, we just log a warning to the user and move
past any chunks with an empty choices list.

A specific example of the type of stack trace this fixes:

```
  File "/app/llama-stack-source/llama_stack/providers/remote/inference/vllm/vllm.py", line 170, in _process_vllm_chat_completion_stream_response
    choice = chunk.choices[0]
             ~~~~~~~~~~~~~^^^
IndexError: list index out of range
```

Now, instead of erroring out with that stack trace, we log a warning
that vLLM failed to generate any completions and alert the user to check
the vLLM server logs for details.

This is related to #1277 and addresses the stack trace shown in that
issue, although does not in and of itself change the functional behavior
of vLLM tool calling.

## Test Plan

As part of this fix, I added new unit tests to trigger this same error
and verify it no longer happens. That is
`test_process_vllm_chat_completion_stream_response_no_choices` in the
new `tests/unit/providers/inference/test_remote_vllm.py`. I also added a
couple of more tests to trigger and verify the last couple of remote
vllm provider bug fixes - specifically a test for #1236 (builtin tool
calling) and #1325 (vLLM <= v0.6.3).

This required fixing the signature of
`_process_vllm_chat_completion_stream_response` to accept the actual
type of chunks it was getting passed - specifically changing from our
openai_compat `OpenAICompatCompletionResponse` to
`openai.types.chat.chat_completion_chunk.ChatCompletionChunk`. It was
not actually getting passed `OpenAICompatCompletionResponse` objects
before, and was using attributes that didn't exist on those objects. So,
the signature now matches the type of object it's actually passed.

Run these new unit tests like this:

```
pytest tests/unit/providers/inference/test_remote_vllm.py
```

Additionally, I ensured the existing `test_text_inference.py` tests
passed via:

```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v tests/integration/inference/test_text_inference.py \
--inference-model "meta-llama/Llama-3.2-3B-Instruct" \
--vision-inference-model ""
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-03-05 15:07:54 -05:00