This adds the vLLM-specific extra_body parameters of prompt_logprobs
and guided_choice to our openai_completion inference endpoint. The
plan here would be to expand this to support all common optional
parameters of any of the OpenAI providers, allowing each provider to
use or ignore these parameters based on whether their server supports them.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
The OpenAI completion API supports strings, array of strings, array of
tokens, or array of token arrays. So, expand our type hinting to
support all of these types.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
The OpenAI completion prompt field can be a string or an array, so
update things to use and pass that properly.
This also stubs in a basic conversion of OpenAI non-streaming
completion requests to Llama Stack completion calls, for those
providers that don't actually have an OpenAI backend to allow them to
still accept requests via the OpenAI APIs.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
current passthrough impl returns chatcompletion_message.content as a
TextItem() , not a straight string. so it's not compatible with other
providers, and causes parsing error downstream.
change away from the generic pydantic conversion, and explicitly parse
out content.text
## Test Plan
setup llama server with passthrough
```
llama-stack-client eval run-benchmark "MMMU_Pro_standard" --model-id meta-llama/Llama-3-8B --output-dir /tmp/ --num-examples 20
```
works without parsing error
## What does this PR do?
We noticed that the passthrough inference provider doesn't work agent
due to the type mis-match between client and server. We manually cast
the llama stack client type to llama stack server type to fix the issue.
## test
run `python -m examples.agents.hello localhost 8321` within
llama-stack-apps
<img width="1073" alt="Screenshot 2025-03-11 at 8 43 44 PM"
src="https://github.com/user-attachments/assets/bd1bdd31-606a-420c-a249-95f6184cc0b1"
/>
fix https://github.com/meta-llama/llama-stack/issues/1560
# What does this PR do?
The commit addresses the Ruff warning B008 by refactoring the code to
avoid calling SamplingParams() directly in function argument defaults.
Instead, it either uses Field(default_factory=SamplingParams) for
Pydantic models or sets the default to None and instantiates
SamplingParams inside the function body when the argument is None.
Signed-off-by: Sébastien Han <seb@redhat.com>
## What does this PR do?
In this PR, we implement a passthrough inference provider that works for
any endpoints that respect llama stack inference API definition.
## Test Plan
config some endpoint that respect llama stack inference API definition
and got the inference results successfully
<img width="1268" alt="Screenshot 2025-02-19 at 8 52 51 PM"
src="https://github.com/user-attachments/assets/447816e4-ea7a-4365-b90c-386dc7dcf4a1"
/>