# What does this PR do?
* Given that our API packages use "import *" in `__init.py__` we don't
need to do `from llama_stack.apis.models.models` but simply from
llama_stack.apis.models. The decision to use `import *` is debatable and
should probably be revisited at one point.
* Remove unneeded Ruff F401 rule
* Consolidate Ruff F403 rule in the pyprojectfrom
llama_stack.apis.models.models
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
These are a couple of fixes to get an example LangChain app working with
our OpenAI Responses API implementation.
The Responses API spec requires an annotations array in
`output[*].content[*].annotations` and we were not providing one. So,
this adds that as an empty list, even though we don't do anything to
populate it yet. This prevents an error from client libraries like
Langchain that expect this field to always exist, even if an empty list.
The other fix is `web_search_preview` is a valid name for the web search
tool in the Responses API, but we only responded to `web_search` or
`web_search_preview_2025_03_11`.
## Test Plan
The existing Responses unit tests were expanded to test these cases,
via:
```
pytest -sv tests/unit/providers/agents/meta_reference/test_openai_responses.py
```
The existing test_openai_responses.py integration tests still pass with
this change, tested as below with Fireworks:
```
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv tests/integration/agents/test_openai_responses.py \
--text-model accounts/fireworks/models/llama4-scout-instruct-basic
```
Lastly, this example LangChain app now works with Llama stack (tested
with Ollama in the starter template in this case). This LangChain code
is using the example snippets for using Responses API at
https://python.langchain.com/docs/integrations/chat/openai/#responses-api
```python
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:8321/v1/openai/v1",
api_key="fake",
model="ollama/meta-llama/Llama-3.2-3B-Instruct",
)
tool = {"type": "web_search_preview"}
llm_with_tools = llm.bind_tools([tool])
response = llm_with_tools.invoke("What was a positive news story from today?")
print(response.content)
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
dropped python3.10, updated pyproject and dependencies, and also removed
some blocks of code with special handling for enum.StrEnum
Closes#2458
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Move to use vector_stores.search for file search tool in Responses,
which supports filters.
closes#2435
## Test Plan
Added e2e test with fitlers.
myenv ❯ llama stack run llama_stack/templates/fireworks/run.yaml
pytest -sv tests/verifications/openai_api/test_responses.py \
-k 'file_search and filters' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=meta-llama/Llama-3.3-70B-Instruct
# What does this PR do?
This is an initial working prototype of wiring up the `file_search`
builtin tool for the Responses API to our existing rag knowledge search
tool.
This is me seeing what I could pull together on top of the bits we
already have merged. This may not be the ideal way to implement this,
and things like how I shuffle the vector store ids from the original
response API tool request to the actual tool execution feel a bit hacky
(grep for `tool_kwargs["vector_db_ids"]` in `_execute_tool_call` to see
what I mean).
## Test Plan
I stubbed in some new tests to exercise this using text and pdf
documents.
Note that this is currently under tests/verification only because it
sometimes flakes with tool calling of the small Llama-3.2-3B model we
run in CI (and that I use as an example below). We'd want to make the
test a bit more robust in some way if we moved this over to
tests/integration and ran it in CI.
### OpenAI SaaS (to verify test correctness)
```
pytest -sv tests/verifications/openai_api/test_responses.py \
-k 'file_search' \
--base-url=https://api.openai.com/v1 \
--model=gpt-4o
```
### Fireworks with faiss vector store
```
llama stack run llama_stack/templates/fireworks/run.yaml
pytest -sv tests/verifications/openai_api/test_responses.py \
-k 'file_search' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=meta-llama/Llama-3.3-70B-Instruct
```
### Ollama with faiss vector store
This sometimes flakes on Ollama because the quantized small model
doesn't always choose to call the tool to answer the user's question.
But, it often works.
```
ollama run llama3.2:3b
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run ./llama_stack/templates/ollama/run.yaml \
--image-type venv \
--env OLLAMA_URL="http://0.0.0.0:11434"
pytest -sv tests/verifications/openai_api/test_responses.py \
-k'file_search' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=meta-llama/Llama-3.2-3B-Instruct
```
### OpenAI provider with sqlite-vec vector store
```
llama stack run ./llama_stack/templates/starter/run.yaml --image-type venv
pytest -sv tests/verifications/openai_api/test_responses.py \
-k 'file_search' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=openai/gpt-4o-mini
```
### Ensure existing vector store integration tests still pass
```
ollama run llama3.2:3b
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run ./llama_stack/templates/ollama/run.yaml \
--image-type venv \
--env OLLAMA_URL="http://0.0.0.0:11434"
LLAMA_STACK_CONFIG=http://localhost:8321 \
pytest -sv tests/integration/vector_io \
--text-model "meta-llama/Llama-3.2-3B-Instruct" \
--embedding-model=all-MiniLM-L6-v2
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This adds the missing `text` parameter to the Responses API that is how
users control structured outputs. All we do with that parameter is map
it to the corresponding chat completion response_format.
## Test Plan
The new unit tests exercise the various permutations allowed for this
property, while a couple of new verification tests actually use it for
real to verify the model outputs are following the format as expected.
Unit tests:
`python -m pytest -s -v
tests/unit/providers/agents/meta_reference/test_openai_responses.py`
Verification tests:
```
llama stack run llama_stack/templates/together/run.yaml
pytest -s -vv 'tests/verifications/openai_api/test_responses.py' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Note that the verification tests can only be run with a real Llama Stack
server (as opposed to using the library client via
`--provider=stack:together`) because the Llama Stack python client is
not yet updated to accept this text field.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
I think the implementation needs more simplification. Spent way too much
time trying to get the tests pass with models not co-operating :(
Finally had to switch claude-sonnet to get things to pass reliably.
### Test Plan
```
export TAVILY_SEARCH_API_KEY=...
export OPENAI_API_KEY=...
uv run pytest -p no:warnings \
-s -v tests/verifications/openai_api/test_responses.py \
--provider=stack:starter \
--model openai/gpt-4o
```
This adds initial streaming support to the Responses API.
This PR makes sure that the _first_ inference call made to chat
completions streams out.
There's more to be done:
- tool call output tokens need to stream out when possible
- we need to loop through multiple rounds of inference and they all need
to stream out.
## Test Plan
Added a test. Executed as:
```
FIREWORKS_API_KEY=... \
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--provider=stack:fireworks --model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Then, started a llama stack fireworks distro and tested against it like
this:
```
OPENAI_API_KEY=blah \
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--base-url http://localhost:8321/v1/openai/v1 \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
# What does this PR do?
This is not part of the official OpenAI API, but we'll use this for the
logs UI.
In order to support more filtering options, I'm adopting the newly
introduced sql store in in place of the kv store.
## Test Plan
Added integration/unit tests.
# What does this PR do?
Add support for "instructions" to the responses API. Instructions
provide a way to swap out system (or developer) messages in new
responses.
## Test Plan
unit tests added
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
We added:
* make sure docstrings are present with 'params' and 'returns'
* fail if someone sets 'returns: None'
* fix the failing APIs
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This is a combination of what was previously 3 separate PRs - #2069,
#2075, and #2083. It turns out all 3 of those are needed to land a
working function calling Responses implementation. The web search
builtin tool was already working, but this wires in support for custom
function calling.
I ended up combining all three into one PR because they all had lots of
merge conflicts, both with each other but also with #1806 that just
landed. And, because landing any of them individually would have only
left a partially working implementation merged.
The new things added here are:
* Storing of input items from previous responses and restoring of those
input items when adding previous responses to the conversation state
* Handling of multiple input item messages roles, not just "user"
messages.
* Support for custom tools passed into the Responses API to enable
function calling outside of just the builtin websearch tool.
Closes#2074Closes#2080
## Test Plan
### Unit Tests
Several new unit tests were added, and they all pass. Ran via:
```
python -m pytest -s -v tests/unit/providers/agents/meta_reference/test_openai_responses.py
```
### Responses API Verification Tests
I ran our verification run.yaml against multiple providers to ensure we
were getting a decent pass rate. Specifically, I ensured the new custom
tool verification test passed across multiple providers and that the
multi-turn examples passed across at least some of the providers (some
providers struggle with the multi-turn workflows still).
Running the stack setup for verification testing:
```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```
Together, passing 100% as an example:
```
pytest -s -v 'tests/verifications/openai_api/test_responses.py' --provider=together-llama-stack
```
## Documentation
We will need to start documenting the OpenAI APIs, but for now the
Responses stuff is still rapidly evolving so delaying that.
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
This PR adds stubs to the end of functions create_agent_turn,
create_openai_response and job_result.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
Ran provided unit tests
[//]: # (## Documentation)
# What does this PR do?
Mainly tried to cover the entire llama_stack/apis directory, we only
have one left. Some excludes were just noop.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The goal of this PR is code base modernization.
Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)
Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
Add support for the temperature to the responses API
## Test Plan
Manually tested simple case
unit tests added for simple case and tool calls
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
This provides an initial [OpenAI Responses
API](https://platform.openai.com/docs/api-reference/responses)
implementation. The API is not yet complete, and this is more a
proof-of-concept to show how we can store responses in our key-value
stores and use them to support the Responses API concepts like
`previous_response_id`.
## Test Plan
I've added a new
`tests/integration/openai_responses/test_openai_responses.py` as part of
a test-driven development for this new API. I'm only testing this
locally with the remote-vllm provider for now, but it should work with
any of our inference providers since the only API it requires out of the
inference provider is the `openai_chat_completion` endpoint.
```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack build --template remote-vllm --image-type venv --run
```
```
LLAMA_STACK_CONFIG="http://localhost:8321" \
python -m pytest -v \
tests/integration/openai_responses/test_openai_responses.py \
--text-model "meta-llama/Llama-3.2-3B-Instruct"
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Allow users to name an agent and use the name in telemetry instead of
relying on randomly generated agent_ids. This improves the developer
experience by making it easier to find specific agents in telemetry
logs.
Closes#1832
## Test Plan
- Added tests to verify the agent name is properly stored and retrieved
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_name_filtering`
from the root of the project and made sure the tests pass
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_query_spans`
to verify existing code without agent names still works correctly
## Use Example
```
agent = Agent(
llama_stack_client,
model=text_model_id,
name="CustomerSupportAgent", # New parameter
instructions="You are a helpful customer support assistant"
)
session_id = agent.create_session(f"test-session-{uuid4()}")
```
## Implementation Notes
- Agent names are optional string parameters with no additional
validation
- Names are not required to be unique - multiple agents can have the
same name
- The agent_id remains the unique identifier for an agent
---------
Co-authored-by: raghotham <raghotham@gmail.com>
# What does this PR do?
Don't set type variables from register_schema().
`mypy` is not happy about it since type variables are calculated at
runtime and hence the typing hints are not available during static
analysis.
Good news is there is no good reason to set the variables from the
return type.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
Add support for listing agents, describing an agent, and retrieving
session IDs for a given agent. This is only the API definition, the
implementations will come separately.
Closes: https://github.com/meta-llama/llama-stack/issues/1294
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The commit addresses the Ruff warning B008 by refactoring the code to
avoid calling SamplingParams() directly in function argument defaults.
Instead, it either uses Field(default_factory=SamplingParams) for
Pydantic models or sets the default to None and instantiates
SamplingParams inside the function body when the argument is None.
Signed-off-by: Sébastien Han <seb@redhat.com>
# Summary:
Client side change in
https://github.com/meta-llama/llama-stack-client-python/pull/180
Changes the resume_turn API to accept `ToolResponse` instead of
`ToolResponseMessage`:
1. `ToolResponse` contains `metadata`
2. `ToolResponseMessage` is a concept for model inputs. Here we are just
submitting the outputs of tool execution.
# Test Plan:
Ran integration tests with newly added test using client tool with
metadata
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --record-responses
# What does this PR do?
- add some docs to OpenAPI for agents/eval/scoring/datasetio
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
- read
[//]: # (## Documentation)
# What does this PR do?
- Deprecate allow_turn_resume flag as this is used for staying backward
compat.
- Closes https://github.com/meta-llama/llama-stack/issues/1363
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/api/agents/test_agents.py --inference-model "meta-llama/Llama-3.3-70B-Instruct" --record-responses
```
<img width="1054" alt="image"
src="https://github.com/user-attachments/assets/d31de2d4-0953-41e1-a71a-7e1579fa351a"
/>
[//]: # (## Documentation)
# Problem
Our current Agent framework has discrepancies in definition on how we
handle server side and client side tools.
1. Server Tools: a single Turn is returned including `ToolExecutionStep`
in agenst
2. Client Tools: `create_agent_turn` is called in loop with client agent
lib yielding the agent chunk
ad6ffc63df/src/llama_stack_client/lib/agents/agent.py (L186-L211)
This makes it inconsistent to work with server & client tools. It also
complicates the logs to telemetry to get information about agents turn /
history for observability.
#### Principle
The same `turn_id` should be used to represent the steps required to
complete a user message including client tools.
## Solution
1. `AgentTurnResponseEventType.turn_awaiting_input` status to indicate
that the current turn is not completed, and awaiting tool input
2. `continue_agent_turn` endpoint to update agent turn with client's
tool response.
# What does this PR do?
- Skeleton API as example
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
- Just API update, no functionality change
```
llama stack run + client-sdk test
```
<img width="842" alt="image"
src="https://github.com/user-attachments/assets/7ac56b5f-f424-4632-9476-7e0f57555bc3"
/>
[//]: # (## Documentation)
# What does this PR do?
This issue was discovered in
https://github.com/meta-llama/llama-stack/pull/1009#discussion_r1947036518.
## Test Plan
This field is no longer required after the change.
[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)
---------
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
llama-models should have extremely minimal cruft. Its sole purpose
should be didactic -- show the simplest implementation of the llama
models and document the prompt formats, etc.
This PR is the complement to
https://github.com/meta-llama/llama-models/pull/279
## Test Plan
Ensure all `llama` CLI `model` sub-commands work:
```bash
llama model list
llama model download --model-id ...
llama model prompt-format -m ...
```
Ran tests:
```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/
LLAMA_STACK_CONFIG=fireworks pytest -s -v vector_io/
LLAMA_STACK_CONFIG=fireworks pytest -s -v agents/
```
Create a fresh venv `uv venv && source .venv/bin/activate` and run
`llama stack build --template fireworks --image-type venv` followed by
`llama stack run together --image-type venv` <-- the server runs
Also checked that the OpenAPI generator can run and there is no change
in the generated files as a result.
```bash
cd docs/openapi_generator
sh run_openapi_generator.sh
```
# What does this PR do?
- Configured ruff linter to automatically fix import sorting issues.
- Set --exit-non-zero-on-fix to ensure non-zero exit code when fixes are
applied.
- Enabled the 'I' selection to focus on import-related linting rules.
- Ran the linter, and formatted all codebase imports accordingly.
- Removed the black dep from the "dev" group since we use ruff
Signed-off-by: Sébastien Han <seb@redhat.com>
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Remove :path in agents, we cannot have :path in params inside endpoints
except last one
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
```
llama stack run
```
[//]: # (## Documentation)
# What does this PR do?
The current default system prompt for llama3.2 tends to overindex on
tool calling and doesn't work well when the prompt does not require tool
calling.
This PR adds an option to override the default system prompt, and
organizes tool-related configs into a new config object.
- [ ] Addresses issue (#issue)
## Test Plan
LLAMA_STACK_CONFIG=together pytest
\-\-inference\-model=meta\-llama/Llama\-3\.3\-70B\-Instruct -s -v
tests/client-sdk/agents/test_agents.py::test_override_system_message_behavior
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
Lint check in main branch is failing. This fixes the lint check after we
moved to ruff in https://github.com/meta-llama/llama-stack/pull/921. We
need to move to a `ruff.toml` file as well as fixing and ignoring some
additional checks.
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
We desperately need to document our APIs. This is the basic requirement
of having a Spec :)
This PR updates the OpenAPI generator so documentation for request
parameters and object fields can be properly added to the OpenAPI specs.
From there, this should get picked by Stainless, etc.
## Test Plan:
Updated client-sdk (See
https://github.com/meta-llama/llama-stack-client-python/pull/104) and
then ran:
```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=../../llama_stack/templates/fireworks/run.yaml pytest -s -v inference/test_inference.py agents/test_agents.py
```
# What does this PR do?
Add response format for agents structured output.
- [ ] Using structured output for agents (interior_design app as an
example) (#issue)
https://github.com/meta-llama/llama-stack-apps/issues/122
## Test Plan
E2E test plan with llama-stack-apps interior_design
Please describe:
Test ran:
- provide instructions so it can be reproduced.
Start your distro:
llama stack run llama_stack/templates/fireworks/run.yaml --env
FIREWORKS_API_KEY=<API_KEY>
Run api test:
```PYTHONPATH=. python examples/interior_design_assistant/api.py localhost 5000 examples/interior_design_assistant/resources/documents/ examples/interior_design_assistant/resources/images/fireplaces```
## Sources
Results:
https://github.com/meta-llama/llama-stack-client-python/pull/72
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
Previously the tests hard coded the tool prompt format to be json which
will cause it to fail when using 3.2/3.3 family of models. This change
make the default to be none for the agent config and just remove the
specification in the tests.
## Test Plan
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v
tests/client-sdk/agents/test_agents.py
Some small updates to the inference types to make them more standard
Specifically:
- image data is now located in a "image" subkey
- similarly tool call data is located in a "tool_call" subkey
The pattern followed is `dict(type="foo", foo=<...>)`
See https://github.com/meta-llama/llama-stack/issues/827 for the broader
design.
This is the first part:
- delete other kinds of memory banks (keyvalue, keyword, graph) for now;
we will introduce a keyvalue store API as part of this design but not
use it in the RAG tool yet.
- renaming of the APIs