# What does this PR do?
reduces duplication and centralizes information to be easier to find for
contributors
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
While adding other tests, I came across this and wasn’t sure how useful
it is. It doesn’t seem to be exercised anywhere in CI, but I figured I’d
fix it anyway. Happy to remove it if preferred. :)
## Test Plan
Run:
```
uv run pytest tests/integration/inference --stack-config=ollama --report=test_report.md -v --text-model="llama3.2:3b" --embedding-model=all-MiniLM-L6-v2
```
Look at the produced `test_report.md`.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
switch sambanova inference adaptor to LiteLLM usage to simplify
integration and solve issues with current adaptor when streaming and
tool calling, models and templates updated
## Test Plan
pytest -s -v tests/integration/inference/test_text_inference.py
--stack-config=sambanova
--text-model=sambanova/Meta-Llama-3.3-70B-Instruct
pytest -s -v tests/integration/inference/test_vision_inference.py
--stack-config=sambanova
--vision-model=sambanova/Llama-3.2-11B-Vision-Instruct
# What does this PR do?
**Fixes** #1959
HuggingFace provides several loading paths that the datasets library can
use. My theory on why the test would previously fail intermittently is
because when calling `load_dataset(...)`, it may be trying several
options such as local cache, Hugging Face Hub, or a dataset script, or
other. There's one of these options that seem to work inconsistently in
the CI.
The HuggingFace datasets library relies on the `transformers` package to
load certain datasets such as `llamastack/simpleqa`, and by adding the
package, we can see the dataset is loaded consistently via the Hugging
Face Hub.
Please see PR in my fork demonstrating over 7 consecutive passes:
https://github.com/ChristianZaccaria/llama-stack/pull/1
**Some References:**
- https://github.com/huggingface/transformers/issues/8690
- https://huggingface.co/docs/datasets/en/loading
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
# What does this PR do?
We've disabled it for a while given that this hasn't worked as well as
expected given the frequent changes of llama_stack_client and how this
requires both repos to be in sync.
## Test Plan
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
The builtin implementation of code interpreter is not robust and has a
really weak sandboxing shell (the `bubblewrap` container). Given the
availability of better MCP code interpreter servers coming up, we should
use them instead of baking an implementation into the Stack and
expanding the vulnerability surface to the rest of the Stack.
This PR only does the removal. We will add examples with how to
integrate with MCPs in subsequent ones.
## Test Plan
Existing tests.
# What does this PR do?
The goal of this PR is code base modernization.
Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)
Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
This provides an initial [OpenAI Responses
API](https://platform.openai.com/docs/api-reference/responses)
implementation. The API is not yet complete, and this is more a
proof-of-concept to show how we can store responses in our key-value
stores and use them to support the Responses API concepts like
`previous_response_id`.
## Test Plan
I've added a new
`tests/integration/openai_responses/test_openai_responses.py` as part of
a test-driven development for this new API. I'm only testing this
locally with the remote-vllm provider for now, but it should work with
any of our inference providers since the only API it requires out of the
inference provider is the `openai_chat_completion` endpoint.
```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack build --template remote-vllm --image-type venv --run
```
```
LLAMA_STACK_CONFIG="http://localhost:8321" \
python -m pytest -v \
tests/integration/openai_responses/test_openai_responses.py \
--text-model "meta-llama/Llama-3.2-3B-Instruct"
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Implemetation of NeMO Datastore register, unregister API.
Open Issues:
- provider_id gets set to `localfs` in client.datasets.register() as it
is specified in routing_tables.py: DatasetsRoutingTable
see: #1860
Currently I have passed `"provider_id":"nvidia"` in metadata and have
parsed that in `DatasetsRoutingTable`
(Not the best approach, but just a quick workaround to make it work for
now.)
## Test Plan
- Unit test cases: `pytest
tests/unit/providers/nvidia/test_datastore.py`
```bash
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, asyncio-0.26.0, nbval-0.11.0, metadata-3.1.1, html-4.1.1, cov-6.1.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items
tests/unit/providers/nvidia/test_datastore.py .. [100%]
============================================================ warnings summary ============================================================
====================================================== 2 passed, 1 warning in 0.84s ======================================================
```
cc: @dglogo, @mattf, @yanxi0830
# What does this PR do?
Allow users to name an agent and use the name in telemetry instead of
relying on randomly generated agent_ids. This improves the developer
experience by making it easier to find specific agents in telemetry
logs.
Closes#1832
## Test Plan
- Added tests to verify the agent name is properly stored and retrieved
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_name_filtering`
from the root of the project and made sure the tests pass
- Ran `uv run -- pytest -v
tests/integration/telemetry/test_telemetry.py::test_agent_query_spans`
to verify existing code without agent names still works correctly
## Use Example
```
agent = Agent(
llama_stack_client,
model=text_model_id,
name="CustomerSupportAgent", # New parameter
instructions="You are a helpful customer support assistant"
)
session_id = agent.create_session(f"test-session-{uuid4()}")
```
## Implementation Notes
- Agent names are optional string parameters with no additional
validation
- Names are not required to be unique - multiple agents can have the
same name
- The agent_id remains the unique identifier for an agent
---------
Co-authored-by: raghotham <raghotham@gmail.com>
# What does this PR do?
TLDR: Changes needed to get 100% passing tests for OpenAI API
verification tests when run against Llama Stack with the `together`,
`fireworks`, and `openai` providers. And `groq` is better than before,
at 88% passing.
This cleans up the OpenAI API support for image message types
(specifically `image_url` types) and handling of the `response_format`
chat completion parameter. Both of these required a few more Pydantic
model definitions in our Inference API, just to move from the
not-quite-right stubs I had in place to something fleshed out to match
the actual OpenAI API specs.
As part of testing this, I also found and fixed a bug in the litellm
implementation of openai_completion and openai_chat_completion, so the
providers based on those should actually be working now.
The method `prepare_openai_completion_params` in
`llama_stack/providers/utils/inference/openai_compat.py` was improved to
actually recursively clean up input parameters, including handling of
lists, dicts, and dumping of Pydantic models to dicts. These changes
were required to get to 100% passing tests on the OpenAI API
verification against the `openai` provider.
With the above, the together.ai provider was passing as well as it is
without Llama Stack. But, since we have Llama Stack in the middle, I
took the opportunity to clean up the together.ai provider so that it now
also passes the OpenAI API spec tests we have at 100%. That means
together.ai is now passing our verification test better when using an
OpenAI client talking to Llama Stack than it is when hitting together.ai
directly, without Llama Stack in the middle.
And, another round of work for Fireworks to improve translation of
incoming OpenAI chat completion requests to Llama Stack chat completion
requests gets the fireworks provider passing at 100%. The server-side
fireworks.ai tool calling support with OpenAI chat completions and Llama
4 models isn't great yet, but by pointing the OpenAI clients at Llama
Stack's API we can clean things up and get everything working as
expected for Llama 4 models.
## Test Plan
### OpenAI API Verification Tests
I ran the OpenAI API verification tests as below and 100% of the tests
passed.
First, start a Llama Stack server that runs the `openai` provider with
the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template
setup to do this out of the box, so I added a
`tests/verifications/openai-api-verification-run.yaml` to do this.
First, ensure you have the necessary API key environment variables set:
```
export TOGETHER_API_KEY="..."
export FIREWORKS_API_KEY="..."
export OPENAI_API_KEY="..."
```
Then, run a Llama Stack server that serves up all these providers:
```
llama stack run \
--image-type venv \
tests/verifications/openai-api-verification-run.yaml
```
Finally, generate a new verification report against all these providers,
both with and without the Llama Stack server in the middle.
```
python tests/verifications/generate_report.py \
--run-tests \
--provider \
together \
fireworks \
groq \
openai \
together-llama-stack \
fireworks-llama-stack \
groq-llama-stack \
openai-llama-stack
```
You'll see that most of the configurations with Llama Stack in the
middle now pass at 100%, even though some of them do not pass at 100%
when hitting the backend provider's API directly with an OpenAI client.
### OpenAI Completion Integration Tests with vLLM:
I also ran the smaller `test_openai_completion.py` test suite (that's
not yet merged with the verification tests) on multiple of the
providers, since I had to adjust the method signature of
openai_chat_completion a bit and thus had to touch lots of these
providers to match. Here's the tests I ran there, all passing:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```
### OpenAI Completion Integration Tests with ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```
### OpenAI Completion Integration Tests with together.ai
```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo"
```
### OpenAI Completion Integration Tests with fireworks.ai
```
INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run
```
in another terminal
```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct"
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR adds two methods to the Inference API:
- `batch_completion`
- `batch_chat_completion`
The motivation is for evaluations targeting a local inference engine
(like meta-reference or vllm) where batch APIs provide for a substantial
amount of acceleration.
Why did I not add this to `Api.batch_inference` though? That just
resulted in a _lot_ more book-keeping given the structure of Llama
Stack. Had I done that, I would have needed to create a notion of a
"batch model" resource, setup routing based on that, etc. This does not
sound ideal.
So what's the future of the batch inference API? I am not sure. Maybe we
can keep it for true _asynchronous_ execution. So you can submit
requests, and it can return a Job instance, etc.
## Test Plan
Run meta-reference-gpu using:
```bash
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144
LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu
```
Then run the batch inference test case.
# What does this PR do?
This stubs in some OpenAI server-side compatibility with three new
endpoints:
/v1/openai/v1/models
/v1/openai/v1/completions
/v1/openai/v1/chat/completions
This gives common inference apps using OpenAI clients the ability to
talk to Llama Stack using an endpoint like
http://localhost:8321/v1/openai/v1 .
The two "v1" instances in there isn't awesome, but the thinking is that
Llama Stack's API is v1 and then our OpenAI compatibility layer is
compatible with OpenAI V1. And, some OpenAI clients implicitly assume
the URL ends with "v1", so this gives maximum compatibility.
The openai models endpoint is implemented in the routing layer, and just
returns all the models Llama Stack knows about.
The following providers should be working with the new OpenAI
completions and chat/completions API:
* remote::anthropic (untested)
* remote::cerebras-openai-compat (untested)
* remote::fireworks (tested)
* remote::fireworks-openai-compat (untested)
* remote::gemini (untested)
* remote::groq-openai-compat (untested)
* remote::nvidia (tested)
* remote::ollama (tested)
* remote::openai (untested)
* remote::passthrough (untested)
* remote::sambanova-openai-compat (untested)
* remote::together (tested)
* remote::together-openai-compat (untested)
* remote::vllm (tested)
The goal to support this for every inference provider - proxying
directly to the provider's OpenAI endpoint for OpenAI-compatible
providers. For providers that don't have an OpenAI-compatible API, we'll
add a mixin to translate incoming OpenAI requests to Llama Stack
inference requests and translate the Llama Stack inference responses to
OpenAI responses.
This is related to #1817 but is a bit larger in scope than just chat
completions, as I have real use-cases that need the older completions
API as well.
## Test Plan
### vLLM
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```
### ollama
```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```
## Documentation
Run a Llama Stack distribution that uses one of the providers mentioned
in the list above. Then, use your favorite OpenAI client to send
completion or chat completion requests with the base_url set to
http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the
host and port of your Llama Stack server, if different.
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Fixes issue #1537 that causes "500 Internal Server Error" when
unregistering a toolgroup
# (Closes#1537 )
## Test Plan
```console
$ pytest -s -v tests/integration/tool_runtime/test_registration.py --stack-config=ollama --env INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
INFO 2025-03-14 21:15:03,999 tests.integration.conftest:41 tests: Setting DISABLE_CODE_SANDBOX=1 for macOS
/opt/homebrew/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
===================================================== test session starts =====================================================
platform darwin -- Python 3.10.16, pytest-8.3.5, pluggy-1.5.0 -- /opt/homebrew/opt/python@3.10/bin/python3.10
cachedir: .pytest_cache
rootdir: /Users/paolo/Projects/aiplatform/llama-stack
configfile: pyproject.toml
plugins: asyncio-0.25.3, anyio-4.8.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 1 item
tests/integration/tool_runtime/test_registration.py::test_register_and_unregister_toolgroup[None-None-None-None-None] INFO 2025-03-14 21:15:04,478 llama_stack.providers.remote.inference.ollama.ollama:75 inference: checking
connectivity to Ollama at `http://localhost:11434`...
INFO 2025-03-14 21:15:05,350 llama_stack.providers.remote.inference.ollama.ollama:294 inference: Pulling embedding
model `all-minilm:latest` if necessary...
INFO: Started server process [78391]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 127.0.0.1:57424 - "GET /sse HTTP/1.1" 200 OK
INFO: 127.0.0.1:57434 - "GET /sse HTTP/1.1" 200 OK
INFO 2025-03-14 21:15:16,129 mcp.client.sse:51 uncategorized: Connecting to SSE endpoint: http://localhost:8000/sse
INFO: 127.0.0.1:57445 - "GET /sse HTTP/1.1" 200 OK
INFO 2025-03-14 21:15:16,146 mcp.client.sse:71 uncategorized: Received endpoint URL:
http://localhost:8000/messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b
INFO 2025-03-14 21:15:16,147 mcp.client.sse:140 uncategorized: Starting post writer with endpoint URL:
http://localhost:8000/messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b
INFO: 127.0.0.1:57447 - "POST /messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b HTTP/1.1" 202 Accepted
INFO: 127.0.0.1:57447 - "POST /messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b HTTP/1.1" 202 Accepted
INFO: 127.0.0.1:57447 - "POST /messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b HTTP/1.1" 202 Accepted
INFO 2025-03-14 21:15:16,155 mcp.server.lowlevel.server:535 uncategorized: Processing request of type
ListToolsRequest
PASSED
=============================================== 1 passed, 4 warnings in 12.17s ================================================
```
---------
Signed-off-by: Paolo Dettori <dettori@us.ibm.com>
# What does this PR do?
## Test Plan
export MODEL=accounts/fireworks/models/llama4-scout-instruct-basic;
LLAMA_STACK_CONFIG=verification pytest -s -v tests/integration/inference
--vision-model $MODEL --text-model $MODEL
# What does this PR do?
Move around bits. This makes the copies from llama-models _much_ easier
to maintain and ensures we don't entangle meta-reference specific
tidbits into llama-models code even by accident.
Also, kills the meta-reference-quantized-gpu distro and rolls
quantization deps into meta-reference-gpu.
## Test Plan
```
LLAMA_MODELS_DEBUG=1 \
with-proxy llama stack run meta-reference-gpu \
--env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct \
--env INFERENCE_CHECKPOINT_DIR=<DIR> \
--env MODEL_PARALLEL_SIZE=4 \
--env QUANTIZATION_TYPE=fp8_mixed
```
Start a server with and without quantization. Point integration tests to
it using:
```
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config http://localhost:8321 --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Running full Tool Calling required some updates to work e2e.
- Remove `python_start` and `python_end` tags
- Tool Call messages and Tool Resposne messages should end with
`<|eom|>`
- System prompt needed updates
```
You are a helpful assisant who can can answer general questions or invoke tools when necessary.
In addition to tool calls, you should also augment your responses by using the tool outputs.
```
### Test Plan
- Start server with meta-reference
```
LLAMA_STACK_DISABLE_VERSION_CHECK=1 LLAMA_MODELS_DEBUG=1 INFERENCE_MODEL=meta-llama/$MODEL llama stack run meta-reference-gpu
```
- Added **NEW** tests with 5 test cases for multi-turn tool calls
```
pytest -s -v --stack-config http://localhost:8321 tests/integration/inference/test_text_inference.py --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
- Also verified all vision and agent tests pass
Previously, the integration tests started the server, but never really
used it because `--stack-config=ollama` uses the ollama template and the
inline "llama stack as library" client, not the HTTP client.
This PR makes sure we test it both ways.
We also add agents tests to the mix.
## Test Plan
GitHub
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
* Added `--text-model` in example command.
* Added link to integration tests instruction and a note on specifying
models.
This is to avoid confusion when all tests are skipped because no model
is provided.
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
https://github.com/meta-llama/llama-stack/pull/1828 removed
__root_span__ attribute which is still needed
## Test Plan
added telemetry integration test
LLAMA_STACK_CONFIG=http://localhost:5001 pytest -s -v
tests/integration/telemetry --safety-shield meta-llama/Llama-Guard-3-8B
--text-model accounts/fireworks/models/llama-v3p3-70b-instruct
# What does this PR do?
- this is a flaky test dependent on model output
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
<img width="853" alt="image"
src="https://github.com/user-attachments/assets/e7607877-22a9-48e3-adac-e991d1070ec0"
/>
[//]: # (## Documentation)
# What does this PR do?
- we no longer query vector db when uploading documents as attachments
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
pytest --stack-config="http://localhost:8321" -v tests/integration/agents/test_agents.py --text-model meta-llama/Llama-3.3-70B-Instruct
```
```
pytest --stack-config=fireworks -v tests/integration/agents/test_agents.py --text-model meta-llama/Llama-3.3-70B-Instruct --record-responses
```
<img width="1160" alt="image"
src="https://github.com/user-attachments/assets/90700f79-c002-4474-bb41-7bc0a39dc91c"
/>
[//]: # (## Documentation)
# What does this PR do?
- We cannot directly return a literal type
> Note: this is not final jobs API change
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
<img width="837" alt="image"
src="https://github.com/user-attachments/assets/18a17561-35f9-443d-987d-54afdd6ff40c"
/>
[//]: # (## Documentation)
# What does this PR do?
with the new /v1/providers API, /v1/inspect/providers is duplicative,
deprecate it by removing the route, and add a test for the full
/v1/providers API
resolves#1623
## Test Plan
`uv run pytest -v tests/integration/providers --stack-config=ollama
--text-model="meta-llama/Llama-3.2-3B-Instruct"
--embedding-model=all-MiniLM-L6-v2`
<img width="1512" alt="Screenshot 2025-03-18 at 9 18 38 AM"
src="https://github.com/user-attachments/assets/2db30f25-3ff6-4374-b39d-0047f093fe36"
/>
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
FAILED
tests/integration/tools/test_tools.py::test_toolsgroups_unregister[None]
- AttributeError: 'coroutine' object has no attribute 'data'
## Test Plan
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/tools/test_tools.py
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1704).
* #1705
* __->__ #1704
# Summary:
Includes fixes to get test_agents working with openAI model, e.g. tool
parsing and message conversion
# Test Plan:
```
LLAMA_STACK_CONFIG=dev pytest -s -v tests/integration/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model openai/gpt-4o-mini
```
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1550).
* #1556
* __->__ #1550
# What does this PR do?
quick fix as the vision_inference test dog.jpg path has been changed.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
# What does this PR do?
currently the `inspect` API for providers is really a `list` API. Create
a new `providers` API which has a GET `providers/{provider_id}` inspect
API
which returns "user friendly" configuration to the end user. Also add a
GET `/providers` endpoint which returns the list of providers as
`inspect/providers` does today.
This API follows CRUD and is more intuitive/RESTful.
This work is part of the RFC at
https://github.com/meta-llama/llama-stack/pull/1359
sensitive fields are redacted using `redact_sensetive_fields` on the
server side before returning a response:
<img width="456" alt="Screenshot 2025-03-13 at 4 40 21 PM"
src="https://github.com/user-attachments/assets/9465c221-2a26-42f8-a08a-6ac4a9fecce8"
/>
## Test Plan
using https://github.com/meta-llama/llama-stack-client-python/pull/181 a
user is able to to run the following:
`llama stack build --template ollama --image-type venv`
`llama stack run --image-type venv
~/.llama/distributions/ollama/ollama-run.yaml`
`llama-stack-client providers inspect ollama`
<img width="378" alt="Screenshot 2025-03-13 at 4 39 35 PM"
src="https://github.com/user-attachments/assets/8273d05d-8bc3-44c6-9e4b-ef95e48d5466"
/>
also, was able to run the new test_list integration test locally with
ollama:
<img width="1509" alt="Screenshot 2025-03-13 at 11 03 40 AM"
src="https://github.com/user-attachments/assets/9b9db166-f02f-45b0-86a4-306d85149bc8"
/>
Signed-off-by: Charlie Doern <cdoern@redhat.com>