Summary:
Currently we don't set the best tool_prompt_format according to model as
promisd.
Test Plan:
Added print around raw model input and inspected manually
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1214).
* #1234
* __->__ #1214
# What does this PR do?
When building providers in a virtual environment or containers, special
pip dependencies may not always be provided (e.g., for Ollama). The
check should only fail if the required number of arguments is missing.
Currently, two arguments are mandatory:
1. Environment name
2. Pip dependencies
Additionally, return statements were replaced with sys.exit(1) in error
conditions to ensure immediate termination on critical failures. Error
handling in the stack build process was also improved to guarantee the
program exits with status 1 when facing configuration issues or build
failures.
Signed-off-by: Sébastien Han <seb@redhat.com>
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
This command shouldn't fail:
```
llama stack build --template ollama --image-type venv
```
[//]: # (## Documentation)
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
--run runs the stack that was just build using the same arguments during
the build process (image-name, type, etc)
This simplifies the workflow a lot and makes the UX better for most
local users trying to get started rather than having to match the flags
of the two commands (build and then run)
Also, moved `ImageType` to distribution.utils since there were circular
import errors with its old location
## Test Plan
tested locally using the following command:
`llama stack build --run --template ollama --image-type venv`
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
`llama model list` or `llama model list --show-all` will list more or
all for the models, so add the `search` option to simplify the output.
```
$ llama model list --help
usage: llama model list [-h] [--show-all] [-s SEARCH]
Show available llama models
options:
-h, --help show this help message and exit
--show-all Show all models (not just defaults)
-s SEARCH, --search SEARCH
Search for the input string as a substring in the model descriptor(ID)
$ llama model list -s 70b
+-----------------------+-----------------------------------+----------------+
| Model Descriptor(ID) | Hugging Face Repo | Context Length |
+-----------------------+-----------------------------------+----------------+
| Llama3.1-70B | meta-llama/Llama-3.1-70B | 128K |
+-----------------------+-----------------------------------+----------------+
| Llama3.1-70B-Instruct | meta-llama/Llama-3.1-70B-Instruct | 128K |
+-----------------------+-----------------------------------+----------------+
| Llama3.3-70B-Instruct | meta-llama/Llama-3.3-70B-Instruct | 128K |
+-----------------------+-----------------------------------+----------------+
$ llama model list -s 3.1-8b
+----------------------+----------------------------------+----------------+
| Model Descriptor(ID) | Hugging Face Repo | Context Length |
+----------------------+----------------------------------+----------------+
| Llama3.1-8B | meta-llama/Llama-3.1-8B | 128K |
+----------------------+----------------------------------+----------------+
| Llama3.1-8B-Instruct | meta-llama/Llama-3.1-8B-Instruct | 128K |
+----------------------+----------------------------------+----------------+
$ llama model list --show-all -s pro
+----------------------+-----------------------------+----------------+
| Model Descriptor(ID) | Hugging Face Repo | Context Length |
+----------------------+-----------------------------+----------------+
| Prompt-Guard-86M | meta-llama/Prompt-Guard-86M | 2K |
+----------------------+-----------------------------+----------------+
$ llama model list -s k
Not found for search.
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
This PR begins the process of supporting non-llama models within Llama
Stack. We start simple by adding support for this functionality within a
few existing providers: fireworks, together and ollama.
## Test Plan
```bash
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model accounts/fireworks/models/phi-3-vision-128k-instruct
```
^ this passes most of the tests but as expected fails the tool calling
related tests since they are very specific to Llama models
```
inference/test_text_inference.py::test_text_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_completion_log_probs_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_completion_log_probs_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_text_completion_structured_output[accounts/fireworks/models/phi-3-vision-128k-instruct-completion-01] PASSED
inference/test_text_inference.py::test_text_chat_completion_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-Which planet do humans live on?-Earth] PASSED
inference/test_text_inference.py::test_text_chat_completion_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-Which planet has rings around it with a name starting w
ith letter S?-Saturn] PASSED
inference/test_text_inference.py::test_text_chat_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-What's the name of the Sun in latin?-Sol] PASSED
inference/test_text_inference.py::test_text_chat_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-What is the name of the US captial?-Washington] PASSED
inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_required[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_none[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_text_chat_completion_structured_output[accounts/fireworks/models/phi-3-vision-128k-instruct] ERROR
inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[accounts/fireworks/models/phi-3-vision-128k-instruct-True] PASSED
inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[accounts/fireworks/models/phi-3-vision-128k-instruct-False] PASSED
```
# What does this PR do?
- Enable mypy to run in the CI on a subset of the repository
- Fix a few mypy errors
- Run mypy from pre-commit
Signed-off-by: Sébastien Han <seb@redhat.com>
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Sébastien Han <seb@redhat.com>
Summary:
Allows tools to output metadata. This is useful for evaluating tool
outputs, e.g. RAG tool will output document IDs, which can be used to
score recall.
Will need to make a similar change on the client side to support
ClientTool outputting metadata.
Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/client-sdk/agents/test_agents.py
# Problem
Our current Agent framework has discrepancies in definition on how we
handle server side and client side tools.
1. Server Tools: a single Turn is returned including `ToolExecutionStep`
in agenst
2. Client Tools: `create_agent_turn` is called in loop with client agent
lib yielding the agent chunk
ad6ffc63df/src/llama_stack_client/lib/agents/agent.py (L186-L211)
This makes it inconsistent to work with server & client tools. It also
complicates the logs to telemetry to get information about agents turn /
history for observability.
#### Principle
The same `turn_id` should be used to represent the steps required to
complete a user message including client tools.
## Solution
1. `AgentTurnResponseEventType.turn_awaiting_input` status to indicate
that the current turn is not completed, and awaiting tool input
2. `continue_agent_turn` endpoint to update agent turn with client's
tool response.
# What does this PR do?
- Skeleton API as example
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
- Just API update, no functionality change
```
llama stack run + client-sdk test
```
<img width="842" alt="image"
src="https://github.com/user-attachments/assets/7ac56b5f-f424-4632-9476-7e0f57555bc3"
/>
[//]: # (## Documentation)
Embedding models are tiny and can be pulled on-demand. Let's do that so
the user doesn't have to do "yet another thing" to get themselves set
up.
Thanks @hardikjshah for the suggestion.
Also fixed a build dependency miss (TODO: distro_codegen needs to
actually check that the build template contains all providers mentioned
for the run.yaml file)
## Test Plan
First run `ollama rm all-minilm:latest`.
Run `llama stack build --template ollama && llama stack run ollama --env
INFERENCE_MODEL=llama3.2:3b-instruct-fp16`. See that it outputs a
"Pulling embedding model `all-minilm:latest`" output and the stack
starts up correctly. Verify that `ollama list` shows the model is
correctly downloaded.
# What does this PR do?
Addresses issues where the container is unable to run as root. Gives
write access to required folders.
[//]: # (If resolving an issue, uncomment and update the line below)
(Closes #[1207])
## Test Plan
I built locally and ran `llama stack build --template remote-vllm
--image-type container` and validated I could see my changes in the
output:
```
#11 1.186 Installed 11 packages in 61ms
#11 1.186 + llama-models==0.1.3
#11 1.186 + llama-stack==0.1.3
#11 1.186 + llama-stack-client==0.1.3
#11 1.186 + markdown-it-py==3.0.0
#11 1.186 + mdurl==0.1.2
#11 1.186 + prompt-toolkit==3.0.50
#11 1.186 + pyaml==25.1.0
#11 1.186 + pygments==2.19.1
#11 1.186 + rich==13.9.4
#11 1.186 + tiktoken==0.9.0
#11 1.186 + wcwidth==0.2.13
#11 DONE 1.6s
#12 [ 9/10] RUN mkdir -p /.llama /.cache
#12 DONE 0.3s
#13 [10/10] RUN chmod -R g+rw /app /.llama /.cache
#13 DONE 0.3s
#14 exporting to image
#14 exporting layers
#14 exporting layers 3.7s done
#14 writing image sha256:11cc8bd954db6d036037bcaf471b173ddd5261ac4b1e72074cccf85d18aefb96 done
#14 naming to docker.io/library/distribution-remote-vllm:0.1.3 done
#14 DONE 3.7s
+ set +x
Success!
```
This is what the resulting image looks like:

Also tagged the image as `0.1.3-test` and [pushed to
quay](https://quay.io/repository/jland/distribution-remote-vllm?tab=tags)
(note there are a bunch of critical vulnerabilities we may want to look
into)
And for good measure I deployed the resulting image on my Openshift
environment using the default Security Context and validated that there
were no issue with it coming up.
My validation was all done with the `vllm-remote` distribution, but if I
am understanding everything correctly the other distributions are just
different run.yaml configs.
[//]: # (## Documentation)
Please let me know if there is anything else I need to do.
Co-authored-by: Jamie Land <hokie10@gmail.com>
# What does this PR do?
This fixes the following error:
```
$ make html
/bin/sh: line 1: sphinx-build: command not found
make: *** [Makefile:20: html] Error 127
```
Also clarifies that this command only rebuilds the website without
watching/refreshes.
## Test Plan
New command works.
---------
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
Currently , `model` in `--downloaded` just use the directory(already
replace `:`), so covert back to descriptor keep the same with ` llama
model list`, and remove command also use `descriptor`.
```
before:
$ llama model list --downloaded
+-------------------------------------+----------+---------------------+
| Model | Size | Modified Time |
+-------------------------------------+----------+---------------------+
| Llama3.2-1B-Instruct-int4-qlora-eo8 | 1.53 GB | 2025-02-20 16:32:49 |
+-------------------------------------+----------+---------------------+
after:
$ llama model list --downloaded
+-------------------------------------+----------+---------------------+
| Model | Size | Modified Time |
+-------------------------------------+----------+---------------------+
| Llama3.2-1B-Instruct:int4-qlora-eo8 | 1.53 GB | 2025-02-20 16:32:49 |
+-------------------------------------+----------+---------------------+
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
- Updates ImageContentItemImageURL import
- fixes `embedding_dimensions` metadata param
## Test Plan
- Ran pytest locally, verified embedding tests pass with new types

cc: @dglogo @sumitb
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
When tried to use `configure`, and found it `DEPRECATED`, and found pr
https://github.com/meta-llama/llama-stack/pull/371 to remove it, not
sure why not remove the `configure.py`?
```
$ llama stack configure /tmp/test.yaml
usage: llama stack configure [-h] [--output-dir OUTPUT_DIR] config
llama stack configure: error:
DEPRECATED! llama stack configure has been deprecated.
Please use llama stack run <path/to/run.yaml> instead.
Please see example run.yaml in /distributions folder.
```
It would better better to tell when user check it how to use with
`--help` first:
```
before:
$ llama stack configure --help
usage: llama stack configure [-h] [--output-dir OUTPUT_DIR] config
Configure a llama stack distribution
positional arguments:
after:
$ llama stack configure --help
usage: llama stack configure [-h] [--output-dir OUTPUT_DIR] config
Configure a llama stack distribution
DEPRECATED! llama stack configure has been deprecated.
Please use llama stack run <path/to/run.yaml> instead.
Please see example run.yaml in /distributions folder.
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
add a subcommand, help to clean the unneeded model:
```
$ llama model --help
usage: llama model [-h] {download,list,prompt-format,describe,verify-download,remove} ...
Work with llama models
options:
-h, --help show this help message and exit
$ llama model remove --help
usage: llama model remove [-h] -m MODEL [-f]
Remove the downloaded llama model
options:
-h, --help show this help message and exit
-m MODEL, --model MODEL
Specify the llama downloaded model name
-f, --force Used to forcefully remove the llama model from the storage without further confirmation
$ llama model remove -m Llama3.2-1B-Instruct:int4-qlora-eo8
Are you sure you want to remove Llama3.2-1B-Instruct:int4-qlora-eo8? (y/n): n
Removal aborted.
$ llama model remove -mLlama3.2-1B-Instruct:int4-qlora-eo8-f
Llama3.2-1B-Instruct:int4-qlora-eo8 removed.
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
fixes test to use new name for URL import
## Test Plan
`LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v
tests/client-sdk/inference/test_embedding.py --embedding-model
baai/bge-m3`
See Issue #922
The change is slightly backwards incompatible but no callsite (in our
client codebases or stack-apps) every passes a depth-2
`List[List[InterleavedContentItem]]` (which is now disallowed.)
## Test Plan
```bash
$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
--inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k together test_embeddings.py \
--inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
--inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```
Also ran `tests/client-sdk/inference/test_embeddings.py`
Summary:
Need this to format the completion message with tool_calls correctly.
See added unittest.
Test Plan:
python -m unittest
llama_stack.providers.tests.inference.test_prompt_adapter
# What does this PR do?
Refine the existing update-readthedocs.yml workflow to enhance
automation and reliability. Updates include:
- Expanding path triggers to cover all documentation files (docs/**) and
build artifacts.
- Adding steps to set up Python (3.11), install uv, sync dependencies,
and build HTML using make html.
- Ensuring the ReadTheDocs build trigger only runs on
workflow_dispatch events.
These improvements help validate website builds in PRs, preventing
issues before merging.
Signed-off-by: Sébastien Han <seb@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The `tool_name` attribute of `ToolDefinition` instances can either be a
str or a BuiltinTool enum type. This fixes the remote vLLM provider to
use the value of those BuiltinTool enums when serializing to JSON
instead of attempting to serialize the actual enum to JSON.
Reference of how this is handled in some other areas, since I followed
that same pattern for the remote vLLM provider here:
- [remote nvidia
provider](https://github.com/meta-llama/llama-stack/blob/v0.1.3/llama_stack/providers/remote/inference/nvidia/openai_utils.py#L137-L140)
- [meta reference
provider](https://github.com/meta-llama/llama-stack/blob/v0.1.3/llama_stack/providers/inline/agents/meta_reference/agent_instance.py#L635-L636)
There is opportunity to potentially reconcile the remove nvidia and
remote vllm bits where they are both translating Llama Stack Inference
APIs to OpenAI client requests, but that's a can of worms I didn't want
to open for this bug fix.
This explicitly fixes this error when using the remote vLLM provider and
the agent tests:
```
TypeError: Object of type BuiltinTool is not JSON serializable
```
So, this is related to #1144 and addresses the immediate issue raised
there. With this fix,
`tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search`
now gets past the JSON serialization error when using the remote vLLM
provider and actually attempts to call the web search tool. I don't have
any API keys setup for the actual web search providers yet, so I cannot
verify everything works after that point.
## Test Plan
I ran the `test_builtin_tool_web_search` locally with the remote vLLM
provider like:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search --inference-model "meta-llama/Llama-3.2-3B-Instruct"
```
Before my change, that reproduced the `TypeError: Object of type
BuiltinTool is not JSON serializable` error. After my change, that error
is gone and the test actually attempts the web search. That failed for
me locally, due to lack of API key, but it gets past the JSON
serialization error.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
```
make html
/bin/sh: line 1: sphinx-build: command not found
make: *** [Makefile:20: html] Error 127
```
## Test Plan
Tested the command `uv run
./docs/openapi_generator/run_openapi_generator.sh` successfully.
# What does this PR do?
add /v1/inference/embeddings implementation to NVIDIA provider
**open topics** -
- *asymmetric models*. NeMo Retriever includes asymmetric models, which
are models that embed differently depending on if the input is destined
for storage or lookup against storage. the /v1/inference/embeddings api
does not allow the user to indicate the type of embedding to perform.
see https://github.com/meta-llama/llama-stack/issues/934
- *truncation*. embedding models typically have a limited context
window, e.g. 1024 tokens is common though newer models have 8k windows.
when the input is larger than this window the endpoint cannot perform
its designed function. two options: 0. return an error so the user can
reduce the input size and retry; 1. perform truncation for the user and
proceed (common strategies are left or right truncation). many users
encounter context window size limits and will struggle to write reliable
programs. this struggle is especially acute without access to the
model's tokenizer. the /v1/inference/embeddings api does not allow the
user to delegate truncation policy. see
https://github.com/meta-llama/llama-stack/issues/933
- *dimensions*. "Matryoshka" embedding models are available. they allow
users to control the number of embedding dimensions the model produces.
this is a critical feature for managing storage constraints. embeddings
of 1024 dimensions what achieve 95% recall for an application may not be
worth the storage cost if a 512 dimensions can achieve 93% recall.
controlling embedding dimensions allows applications to determine their
recall and storage tradeoffs. the /v1/inference/embeddings api does not
allow the user to control the output dimensions. see
https://github.com/meta-llama/llama-stack/issues/932
## Test Plan
- `llama stack run llama_stack/templates/nvidia/run.yaml`
- `LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v
tests/client-sdk/inference/test_embedding.py --embedding-model
baai/bge-m3`
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>