forked from phoenix-oss/llama-stack-mirror
This adds initial streaming support to the Responses API. This PR makes sure that the _first_ inference call made to chat completions streams out. There's more to be done: - tool call output tokens need to stream out when possible - we need to loop through multiple rounds of inference and they all need to stream out. ## Test Plan Added a test. Executed as: ``` FIREWORKS_API_KEY=... \ pytest -s -v 'tests/verifications/openai_api/test_responses.py' \ --provider=stack:fireworks --model meta-llama/Llama-4-Scout-17B-16E-Instruct ``` Then, started a llama stack fireworks distro and tested against it like this: ``` OPENAI_API_KEY=blah \ pytest -s -v 'tests/verifications/openai_api/test_responses.py' \ --base-url http://localhost:8321/v1/openai/v1 \ --model meta-llama/Llama-4-Scout-17B-16E-Instruct ``` |
||
---|---|---|
.. | ||
agents | ||
batch_inference | ||
benchmarks | ||
common | ||
datasetio | ||
datasets | ||
eval | ||
files | ||
inference | ||
inspect | ||
models | ||
post_training | ||
providers | ||
safety | ||
scoring | ||
scoring_functions | ||
shields | ||
synthetic_data_generation | ||
telemetry | ||
tools | ||
vector_dbs | ||
vector_io | ||
__init__.py | ||
datatypes.py | ||
resource.py | ||
version.py |