# What does this PR do?
Adds a new endpoint that is compatible with OpenAI for embeddings api.
`/openai/v1/embeddings`
Added providers for OpenAI, LiteLLM and SentenceTransformer.
## Test Plan
```
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/inference/test_openai_embeddings.py --embedding-model all-MiniLM-L6-v2,text-embedding-3-small,gemini/text-embedding-004
```
# What does this PR do?
Followup of https://github.com/meta-llama/llama-stack/pull/2287. We must
use `--group` when running commands with uv.
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
the providers list is missing post_training. Add that column and
`HuggingFace`, `TorchTune`, and `NVIDIA NEMO` as supported providers.
also point to these providers in docs/source/providers/index.md, and
describe basic functionality
There are other missing provider types here as well, but starting with
this
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
This adds initial streaming support to the Responses API.
This PR makes sure that the _first_ inference call made to chat
completions streams out.
There's more to be done:
- tool call output tokens need to stream out when possible
- we need to loop through multiple rounds of inference and they all need
to stream out.
## Test Plan
Added a test. Executed as:
```
FIREWORKS_API_KEY=... \
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--provider=stack:fireworks --model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Then, started a llama stack fireworks distro and tested against it like
this:
```
OPENAI_API_KEY=blah \
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--base-url http://localhost:8321/v1/openai/v1 \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
When registering a MCP endpoint, we cannot list tools (like we used to)
since the MCP endpoint may be behind an auth wall. Registration can
happen much sooner (via run.yaml).
Instead, we do listing only when the _user_ actually calls listing.
Furthermore, we cache the list in-memory in the server. Currently, the
cache is not invalidated -- we may want to periodically re-list for MCP
servers. Note that they must call `list_tools` before calling
`invoke_tool` -- we use this critically.
This will enable us to list MCP servers in run.yaml
## Test Plan
Existing tests, updated tests accordingly.
Getting this error from pypi of late
```
'python-requests/2.32.3 User-Agents are currently blocked from accessing JSON release resources. A cluster is apparently crawling all project/release resources resulting in excess cache misses. Please contact admin@pypi.org if you have information regarding what this software may be.'
```
# What does this PR do?
This is not part of the official OpenAI API, but we'll use this for the
logs UI.
In order to support more filtering options, I'm adopting the newly
introduced sql store in in place of the kv store.
## Test Plan
Added integration/unit tests.
# What does this PR do?
Includes SambaNova safety adaptor to use the sambanova cloud served
Meta-Llama-Guard-3-8B
minor updates in sambanova docs
## Test Plan
pytest -s -v tests/integration/safety/test_safety.py
--stack-config=sambanova --safety-shield=sambanova/Meta-Llama-Guard-3-8B
# What does this PR do?
This PR introduces support for keyword based FTS5 search with BM25
relevance scoring. It makes changes to the existing EmbeddingIndex base
class in order to support a search_mode and query_str parameter, that
can be used for keyword based search implementations.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
run
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
```
Output:
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
/Users/vnarsing/miniconda3/envs/stack-client/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
====================================================== test session starts =======================================================
platform darwin -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /Users/vnarsing/miniconda3/envs/stack-client/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform': 'macOS-14.7.4-arm64-arm-64bit', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0'}}
rootdir: /Users/vnarsing/go/src/github/meta-llama/llama-stack
configfile: pyproject.toml
plugins: html-4.1.1, metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0
asyncio: mode=auto, asyncio_default_fixture_loop_scope=None
collected 7 items
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_add_chunks PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_vector PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_fts PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_chunk_id_conflict PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_register_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_unregister_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_generate_chunk_id PASSED
```
For reference, with the implementation, the fts table looks like below:
```
Chunk ID: 9fbc39ce-c729-64a2-260f-c5ec9bb2a33e, Content: Sentence 0 from document 0
Chunk ID: 94062914-3e23-44cf-1e50-9e25821ba882, Content: Sentence 1 from document 0
Chunk ID: e6cfd559-4641-33ba-6ce1-7038226495eb, Content: Sentence 2 from document 0
Chunk ID: 1383af9b-f1f0-f417-4de5-65fe9456cc20, Content: Sentence 3 from document 0
Chunk ID: 2db19b1a-de14-353b-f4e1-085e8463361c, Content: Sentence 4 from document 0
Chunk ID: 9faf986a-f028-7714-068a-1c795e8f2598, Content: Sentence 5 from document 0
Chunk ID: ef593ead-5a4a-392f-7ad8-471a50f033e8, Content: Sentence 6 from document 0
Chunk ID: e161950f-021f-7300-4d05-3166738b94cf, Content: Sentence 7 from document 0
Chunk ID: 90610fc4-67c1-e740-f043-709c5978867a, Content: Sentence 8 from document 0
Chunk ID: 97712879-6fff-98ad-0558-e9f42e6b81d3, Content: Sentence 9 from document 0
Chunk ID: aea70411-51df-61ba-d2f0-cb2b5972c210, Content: Sentence 0 from document 1
Chunk ID: b678a463-7b84-92b8-abb2-27e9a1977e3c, Content: Sentence 1 from document 1
Chunk ID: 27bd63da-909c-1606-a109-75bdb9479882, Content: Sentence 2 from document 1
Chunk ID: a2ad49ad-f9be-5372-e0c7-7b0221d0b53e, Content: Sentence 3 from document 1
Chunk ID: cac53bcd-1965-082a-c0f4-ceee7323fc70, Content: Sentence 4 from document 1
```
Query results:
Result 1: Sentence 5 from document 0
Result 2: Sentence 5 from document 1
Result 3: Sentence 5 from document 2
[//]: # (## Documentation)
---------
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
* remove requirements.txt to use pyproject.toml as the source of truth
* update relevant docs
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The cache_ttl config value is not in fact tied to the lifetime of any of
the keys, it represents the time interval between for our key cache
refresher.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Kubernetes since 1.20 exposes a JWKS endpoint that we can use with our
recent oauth2 recent implementation.
The CI test has been kept intact for validation.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
feat(quota): add server‑side per‑client request quotas (requires auth)
Unrestricted usage can lead to runaway costs and fragmented client-side
workarounds. This commit introduces a native quota mechanism to the
server, giving operators a unified, centrally managed throttle for
per-client requests—without needing extra proxies or custom client
logic. This helps contain cloud-compute expenses, enables fine-grained
usage control, and simplifies deployment and monitoring of Llama Stack
services. Quotas are fully opt-in and have no effect unless explicitly
configured.
Notice that Quotas are fully opt-in and require authentication to be
enabled. The 'sqlite' is the only supported quota `type` at this time,
any other `type` will be rejected. And the only supported `period` is
'day'.
Highlights:
- Adds `QuotaMiddleware` to enforce per-client request quotas:
- Uses `Authorization: Bearer <client_id>` (from
AuthenticationMiddleware)
- Tracks usage via a SQLite-based KV store
- Returns 429 when the quota is exceeded
- Extends `ServerConfig` with a `quota` section (type + config)
- Enforces strict coupling: quotas require authentication or the server
will fail to start
Behavior changes:
- Quotas are disabled by default unless explicitly configured
- SQLite defaults to `./quotas.db` if no DB path is set
- The server requires authentication when quotas are enabled
To enable per-client request quotas in `run.yaml`, add:
```
server:
port: 8321
auth:
provider_type: "custom"
config:
endpoint: "https://auth.example.com/validate"
quota:
type: sqlite
config:
db_path: ./quotas.db
limit:
max_requests: 1000
period: day
[//]: # (If resolving an issue, uncomment and update the line below)
Closes#2093
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Wen Liang <wenliang@redhat.com>
Co-authored-by: Wen Liang <wenliang@redhat.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
```
llama stack rm llamastack-test
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
#225
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
# What does this PR do?
Add support for "instructions" to the responses API. Instructions
provide a way to swap out system (or developer) messages in new
responses.
## Test Plan
unit tests added
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
This PR introduces APIs to retrieve past chat completion requests, which
will be used in the LS UI.
Our current `Telemetry` is ill-suited for this purpose as it's untyped
so we'd need to filter by obscure attribute names, making it brittle.
Since these APIs are 'provided by stack' and don't need to be
implemented by inference providers, we introduce a new InferenceProvider
class, containing the existing inference protocol, which is implemented
by inference providers.
The APIs are OpenAI-compliant, with an additional `input_messages`
field.
## Test Plan
This PR just adds the API and marks them provided_by_stack. S
tart stack server -> doesn't crash
# What does this PR do?
adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a
super popular option for running training jobs. The config allows a user
to specify some key fields such as a model, chat_template, device, etc
the provider comes with one recipe `finetune_single_device` which works
both with and without LoRA.
any model that is a valid HF identifier can be given and the model will
be pulled.
this has been tested so far with CPU and MPS device types, but should be
compatible with CUDA out of the box
The provider processes the given dataset into the proper format,
establishes the various steps per epoch, steps per save, steps per eval,
sets a sane SFTConfig, and runs n_epochs of training
if checkpoint_dir is none, no model is saved. If there is a checkpoint
dir, a model is saved every `save_steps` and at the end of training.
## Test Plan
re-enabled post_training integration test suite with a singular test
that loads the simpleqa dataset:
https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite
model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The
test now uses the llama stack client and the proper post_training API
runs one step with a batch_size of 1. This test runs on CPU on the
Ubuntu runner so it needs to be a small batch and a single step.
[//]: # (## Documentation)
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
start_stack.sh was using --yaml-config which is deprecated.
a bunch of distro docs also mentioned --yaml-config. Replaces all
instances and logic for --yaml-config with --config
resolves#2189
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
It may not always be desirable to listen on all interfaces, which is the
default. As an example, by listening instead only on a loopback
interface, the server cannot be reached except from within the host it
is run on. This PR makes this configurable, through a CLI option, an env
var or an entry on the config file.
## Test Plan
I ran a server with and without the added CLI argument to verify that
the argument is used if provided, but the default is as it was before if
not.
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
We added:
* make sure docstrings are present with 'params' and 'returns'
* fail if someone sets 'returns: None'
* fix the failing APIs
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
currently the "default" dir for external providers is
`/etc/llama-stack/providers.d`
This dir is not used anywhere nor created.
Switch to a more friendly `~/.llama/providers.d/`
This allows external providers to actually create this dir and/or
populate it upon installation, `pip` cannot create directories in `etc`.
If a user does not specify a dir, default to this one
see https://github.com/containers/ramalama-stack/issues/36
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Currently the website only displays the "latest" version. This is
because our config and workflow do not include version information. This
PR adds missing version info.
---------
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
his PR allows users to customize the template used for chunks when
inserted into the context. Additionally, this enables metadata injection
into the context of an LLM for RAG. This makes a naive and crude
assumption that each chunk should include the metadata, this is
obviously redundant when multiple chunks are returned from the same
document. In order to remove any sort of duplication of chunks, we'd
have to make much more significant changes so this is a reasonable first
step that unblocks users requesting this enhancement in
https://github.com/meta-llama/llama-stack/issues/1767.
In the future, this can be extended to support citations.
List of Changes:
- `llama_stack/apis/tools/rag_tool.py`
- Added `chunk_template` field in `RAGQueryConfig`.
- Added `field_validator` to validate the `chunk_template` field in
`RAGQueryConfig`.
- Ensured the `chunk_template` field includes placeholders `{index}` and
`{chunk.content}`.
- Updated the `query` method to use the `chunk_template` for formatting
chunk text content.
- `llama_stack/providers/inline/tool_runtime/rag/memory.py`
- Modified the `insert` method to pass `doc.metadata` for chunk
creation.
- Enhanced the `query` method to format results using `chunk_template`
and exclude unnecessary metadata fields like `token_count`.
- `llama_stack/providers/utils/memory/vector_store.py`
- Updated `make_overlapped_chunks` to include metadata serialization and
token count for both content and metadata.
- Added error handling for metadata serialization issues.
- `pyproject.toml`
- Added `pydantic.field_validator` as a recognized `classmethod`
decorator in the linting configuration.
- `tests/integration/tool_runtime/test_rag_tool.py`
- Refactored test assertions to separate `assert_valid_chunk_response`
and `assert_valid_text_response`.
- Added integration tests to validate `chunk_template` functionality
with and without metadata inclusion.
- Included a test case to ensure `chunk_template` validation errors are
raised appropriately.
- `tests/unit/rag/test_vector_store.py`
- Added unit tests for `make_overlapped_chunks`, verifying chunk
creation with overlapping tokens and metadata integrity.
- Added tests to handle metadata serialization errors, ensuring proper
exception handling.
- `docs/_static/llama-stack-spec.html`
- Added a new `chunk_template` field of type `string` with a default
template for formatting retrieved chunks in RAGQueryConfig.
- Updated the `required` fields to include `chunk_template`.
- `docs/_static/llama-stack-spec.yaml`
- Introduced `chunk_template` field with a default value for
RAGQueryConfig.
- Updated the required configuration list to include `chunk_template`.
- `docs/source/building_applications/rag.md`
- Documented the `chunk_template` configuration, explaining how to
customize metadata formatting in RAG queries.
- Added examples demonstrating the usage of the `chunk_template` field
in RAG tool queries.
- Highlighted default values for `RAG` agent configurations.
# Resolves https://github.com/meta-llama/llama-stack/issues/1767
## Test Plan
Updated both `test_vector_store.py` and `test_rag_tool.py` and tested
end-to-end with a script.
I also tested the quickstart to enable this and specified this metadata:
```python
document = RAGDocument(
document_id="document_1",
content=source,
mime_type="text/html",
metadata={"author": "Paul Graham", "title": "How to do great work"},
)
```
Which produced the output below:

This highlights the usefulness of the additional metadata. Notice how
the metadata is redundant for different chunks of the same document. I
think we can update that in a subsequent PR.
# Documentation
I've added a brief comment about this in the documentation to outline
this to users and updated the API documentation.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
The provider selling point *is* using the same provider for both.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
This is a combination of what was previously 3 separate PRs - #2069,
#2075, and #2083. It turns out all 3 of those are needed to land a
working function calling Responses implementation. The web search
builtin tool was already working, but this wires in support for custom
function calling.
I ended up combining all three into one PR because they all had lots of
merge conflicts, both with each other but also with #1806 that just
landed. And, because landing any of them individually would have only
left a partially working implementation merged.
The new things added here are:
* Storing of input items from previous responses and restoring of those
input items when adding previous responses to the conversation state
* Handling of multiple input item messages roles, not just "user"
messages.
* Support for custom tools passed into the Responses API to enable
function calling outside of just the builtin websearch tool.
Closes#2074Closes#2080
## Test Plan
### Unit Tests
Several new unit tests were added, and they all pass. Ran via:
```
python -m pytest -s -v tests/unit/providers/agents/meta_reference/test_openai_responses.py
```
### Responses API Verification Tests
I ran our verification run.yaml against multiple providers to ensure we
were getting a decent pass rate. Specifically, I ensured the new custom
tool verification test passed across multiple providers and that the
multi-turn examples passed across at least some of the providers (some
providers struggle with the multi-turn workflows still).
Running the stack setup for verification testing:
```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```
Together, passing 100% as an example:
```
pytest -s -v 'tests/verifications/openai_api/test_responses.py' --provider=together-llama-stack
```
## Documentation
We will need to start documenting the OpenAI APIs, but for now the
Responses stuff is still rapidly evolving so delaying that.
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
reduces duplication and centralizes information to be easier to find for
contributors
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
This new check will fail if some webmethods are missing the ellipsis:
```
API Method Return Type Validation Errors:
Method Api.eval.job_result does not contain ellipsis (...) in its implementation
Method Api.agents.create_agent_turn does not contain ellipsis (...) in its implementation
Method Api.agents.create_openai_response does not contain ellipsis (...) in its implementation
Method Api.eval.evaluate_rows does not contain ellipsis (...) in its implementation
Method Api.eval.run_eval does not contain ellipsis (...) in its implementation
```
Unless not implemented.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Adds the API to query metrics from telemetry.
## Test Plan
llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
switch sambanova inference adaptor to LiteLLM usage to simplify
integration and solve issues with current adaptor when streaming and
tool calling, models and templates updated
## Test Plan
pytest -s -v tests/integration/inference/test_text_inference.py
--stack-config=sambanova
--text-model=sambanova/Meta-Llama-3.3-70B-Instruct
pytest -s -v tests/integration/inference/test_vision_inference.py
--stack-config=sambanova
--vision-model=sambanova/Llama-3.2-11B-Vision-Instruct
# What does this PR do?
Mainly tried to cover the entire llama_stack/apis directory, we only
have one left. Some excludes were just noop.
Signed-off-by: Sébastien Han <seb@redhat.com>
# Issue
Closes#2073
# What does this PR do?
- Removes the `datasets.rst` from the list of document urls as it no
longer exists in torchtune. Referenced PR:
https://github.com/pytorch/torchtune/pull/1781
- Added a step to run `uv sync`. Previously, I would get the following
error:
```
➜ llama-stack git:(remove-deprecated-rst) uv venv --python 3.10
source .venv/bin/activate
Using CPython 3.10.13 interpreter at: /usr/bin/python3.10
Creating virtual environment at: .venv
Activate with: source .venv/bin/activate
(llama-stack) ➜ llama-stack git:(remove-deprecated-rst) INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type venv --run
zsh: llama: command not found...
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
To test: Run through `rag_agent` example in the `detailed_tutorial.md`
file.
[//]: # (## Documentation)
# What does this PR do?
For the Issue :-
#[2010](https://github.com/meta-llama/llama-stack/issues/2010)
Currently, if we try to connect the Llama stack server to a remote
Milvus instance that has TLS enabled, the connection fails because TLS
support is not implemented in the Llama stack codebase. As a result,
users are unable to use secured Milvus deployments out of the box.
After adding this , the user will be able to connect to remote::Milvus
which is TLS enabled .
if TLS enabled :-
```
vector_io:
- provider_id: milvus
provider_type: remote::milvus
config:
uri: "http://<host>:<port>"
token: "<user>:<password>"
secure: True
server_pem_path: "path/to/server.pem"
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
I have already tested it by connecting to a Milvus instance which is TLS
enabled and i was able to start llama stack server .
# What does this PR do?
The builtin implementation of code interpreter is not robust and has a
really weak sandboxing shell (the `bubblewrap` container). Given the
availability of better MCP code interpreter servers coming up, we should
use them instead of baking an implementation into the Stack and
expanding the vulnerability surface to the rest of the Stack.
This PR only does the removal. We will add examples with how to
integrate with MCPs in subsequent ones.
## Test Plan
Existing tests.
# What does this PR do?
The goal of this PR is code base modernization.
Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)
Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
Add support for the temperature to the responses API
## Test Plan
Manually tested simple case
unit tests added for simple case and tool calls
Signed-off-by: Derek Higgins <derekh@redhat.com>