Commit graph

748 commits

Author SHA1 Message Date
Reid
56c1a50b86
fix: fix the describe table display issue (#1221)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

If not passed the `headers`, it will display empty for the first row,
also might break the second row, make the `Model` row as `headers`.
```
Before:
$ llama model describe -m Llama3.1-70B
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                             ┃                                ┃ <<<---------
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Model             │ Llama3.1-70B         │   <<<---------
├─────────────────────────────┼────────────────────────────────┤
│ Hugging Face ID             │ meta-llama/Llama-3.1-70B       │
├─────────────────────────────┼────────────────────────────────┤
│ Description                 │ Llama 3.1 70b model            │
├─────────────────────────────┼────────────────────────────────┤
......

after:
$ llama model describe -m Llama3.1-70B
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Model                       ┃ Llama3.1-70B                   ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Hugging Face ID             │ meta-llama/Llama-3.1-70B       │
├─────────────────────────────┼────────────────────────────────┤
│ Description                 │ Llama 3.1 70b model            │
├─────────────────────────────┼────────────────────────────────┤
......
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-25 21:34:53 -08:00
Sébastien Han
929c5f0842
refactor(server): replace print statements with logger (#1250)
# What does this PR do?

- Introduced logging in `StackRun` to replace print-based messages
- Improved error handling for config file loading and parsing
- Replaced `cprint` with `logger.error` for consistent error messaging
- Ensured logging is used in `server.py` for startup, shutdown, and
runtime messages
- Added missing exception handling for invalid providers

Signed-off-by: Sébastien Han <seb@redhat.com>

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-02-25 21:31:37 -08:00
Hardik Shah
c0c7622295
fix: dont assume SentenceTransformer is imported
as titled
2025-02-25 16:53:01 -08:00
Vladislav Bronzov
967cff4533
feat: Add Groq distribution template (#1173)
# What does this PR do?

Create a distribution template using Groq as inference provider.
Link to issue: https://github.com/meta-llama/llama-stack/issues/958


## Test Plan
Run `python llama_stack/scripts/distro_codegen.py` to generate run.yaml
and build.yaml
Test the newly created template by running
`llama stack build --template <template-name>`
`llama stack run <template-name>`
2025-02-25 14:16:56 -08:00
LESSuseLESS
3a31611486
feat: completing text /chat-completion and /completion tests (#1223)
# What does this PR do?

The goal is to have a fairly complete set of provider and e2e tests for
/chat-completion and /completion. This is the current list,
```
grep -oE "def test_[a-zA-Z_+]*" llama_stack/providers/tests/inference/test_text_inference.py | cut -d' ' -f2
```
- test_model_list
- test_text_completion_non_streaming
- test_text_completion_streaming
- test_text_completion_logprobs_non_streaming
- test_text_completion_logprobs_streaming
- test_text_completion_structured_output
- test_text_chat_completion_non_streaming
- test_text_chat_completion_structured_output
- test_text_chat_completion_streaming
- test_text_chat_completion_with_tool_calling
- test_text_chat_completion_with_tool_calling_streaming

```
grep -oE "def test_[a-zA-Z_+]*" tests/client-sdk/inference/test_text_inference.py | cut -d' ' -f2
```
- test_text_completion_non_streaming
- test_text_completion_streaming
- test_text_completion_log_probs_non_streaming
- test_text_completion_log_probs_streaming
- test_text_completion_structured_output
- test_text_chat_completion_non_streaming
- test_text_chat_completion_streaming
- test_text_chat_completion_with_tool_calling_and_non_streaming
- test_text_chat_completion_with_tool_calling_and_streaming
- test_text_chat_completion_with_tool_choice_required
- test_text_chat_completion_with_tool_choice_none
- test_text_chat_completion_structured_output
- test_text_chat_completion_tool_calling_tools_not_in_request

## Test plan

== Set up Ollama local server
```
OLLAMA_HOST=127.0.0.1:8321 with-proxy ollama serve
OLLAMA_HOST=127.0.0.1:8321 ollama run llama3.2:3b-instruct-fp16 --keepalive 60m
```

==  Run a provider test
```
conda activate stack
OLLAMA_URL="http://localhost:8321" \
pytest -v -s -k "ollama" --inference-model="llama3.2:3b-instruct-fp16" \
llama_stack/providers/tests/inference/test_text_inference.py::TestInference
```

== Run an e2e test
```
conda activate sherpa
with-proxy pip install llama-stack
export INFERENCE_MODEL=llama3.2:3b-instruct-fp16
export LLAMA_STACK_PORT=8322
with-proxy llama stack build --template ollama
with-proxy llama stack run --env OLLAMA_URL=http://localhost:8321 ollama
```
```
conda activate stack
LLAMA_STACK_PORT=8322 LLAMA_STACK_BASE_URL="http://localhost:8322" \
pytest -v -s --inference-model="llama3.2:3b-instruct-fp16" \
tests/client-sdk/inference/test_text_inference.py
```
2025-02-25 11:37:04 -08:00
Charlie Doern
9b130f96a7
fix: build_venv expects an extra argument (#1233)
# What does this PR do?


currently, build_venv.sh expects a `distribution_type` as the first
argument but the only things ever passed are:

1. image name
2. pip dependencies

so distribution_type is never passed in meaning the script errors when
calling something like:

`llama stack build --image-type venv --template ollama --image-name
test`

before output:

```
llama stack build --image-type venv --template ollama --image-name venv-test
Usage: /Users/charliedoern/projects/Documents/llama-stack/llama_stack/distribution/build_venv.sh <distribution_type> <env_name> <pip_dependencies> [<special_pip_deps>]
Example: /Users/charliedoern/projects/Documents/llama-stack/llama_stack/distribution/build_venv.sh <distribution_type> mybuild ./my-stack-build.yaml 'numpy pandas scipy'
Failed to build target venv-test with return code 1
Run config path is empty
```
after:

```
llama stack build --image-type venv --template ollama --image-name venv-test
Environment 'venv-test' already exists, re-using it.
Using virtual environment venv-test
Using CPython 3.13.0 interpreter at: /opt/homebrew/opt/python@3.13/bin/python3.13
Creating virtual environment at: venv-test
Activate with: source venv-test/bin/activate
Using Python 3.13.0 environment at: venv-test
Resolved 55 packages in 640ms
      Built fire==0.7.0
Prepared 54 packages in 1.14s
Installed 55 packages in 82ms
 + annotated-types==0.7.0
 ```

## Test Plan

ran locally with output above

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-02-25 11:08:50 -08:00
Sébastien Han
c223b1862b
fix: resolve type hint issues and import dependencies (#1176)
# What does this PR do?

- Fixed type hinting and missing imports across multiple modules.
- Improved compatibility by using `TYPE_CHECKING` for conditional
imports.
- Updated `pyproject.toml` to enforce stricter linting.

Signed-off-by: Sébastien Han <seb@redhat.com>

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-02-25 11:06:47 -08:00
Yuan Tang
1a044ef894
fix: Raise exception when tool call result is None (#1253)
# What does this PR do?

When there are issues with the tool call function, an exception is
raised but the error message is not informative. This adds a clearer
message to tell users to check their functions.
```
Traceback (most recent call last):
  File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/distribution/server/server.py", line 208, in sse_generator
    async for item in event_gen:
  File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agents.py", line 165, in _create_agent_turn_streaming
    async for event in agent.create_and_execute_turn(request):
  File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 197, in create_and_execute_turn
    async for chunk in self.run(
  File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 389, in run
    async for res in self._run(
  File "/Users/phayes/projects/llama-stack/llama-stack/llama_stack/providers/inline/agents/meta_reference/agent_instance.py", line 811, in _run
    content=tool_result.content,
AttributeError: 'NoneType' object has no attribute 'content'
```

## Test Plan

Ran the same script and exception is raised with clearer error message.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-25 13:10:50 -05:00
Jeff Tang
73a0c7a0e7
LocalInferenceImpl update for LS013 (#1242)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)
2025-02-25 09:58:34 -08:00
ehhuang
dc3c881ffe
fix: include timezone in Agent steps' timestamps (#1247)
Summary:

kotlin SDK expects this format

Test Plan:

python prints the expected format
>>> str(datetime.now().astimezone())
'2025-02-24 22:02:58.729763-08:00'
2025-02-25 09:49:25 -08:00
Charlie Doern
4684fd3f8d
refactor: combine start scripts for each env (#1139)
# What does this PR do?

now that llama stack supports running in venv, conda, and container
modes and the 3 scripts overlap alot, combine these three into ons
`start_stack.sh` script

## Test Plan

tested this locally on venv, conda, and container

---------

Signed-off-by: Charlie Doern <cdoern@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-24 16:53:31 -08:00
Ashwin Bharambe
9b0f783e54
test: add a ci-tests distro template for running e2e tests (#1237) 2025-02-24 14:43:21 -08:00
ehhuang
14c38acf97
fix: set default tool_prompt_format in inference api (#1214)
Summary:
Currently we don't set the best tool_prompt_format according to model as
promisd.

Test Plan:
Added print around raw model input and inspected manually
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1214).
* #1234
* __->__ #1214
2025-02-24 12:38:37 -08:00
Sébastien Han
c4987bc349
fix: avoid failure when no special pip deps and better exit (#1228)
# What does this PR do?

When building providers in a virtual environment or containers, special
pip dependencies may not always be provided (e.g., for Ollama). The
check should only fail if the required number of arguments is missing.
Currently, two arguments are mandatory:

1. Environment name
2. Pip dependencies

Additionally, return statements were replaced with sys.exit(1) in error
conditions to ensure immediate termination on critical failures. Error
handling in the stack build process was also improved to guarantee the
program exits with status 1 when facing configuration issues or build
failures.

Signed-off-by: Sébastien Han <seb@redhat.com>

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

This command shouldn't fail:

```
llama stack build --template ollama --image-type venv
```

[//]: # (## Documentation)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-02-24 13:18:52 -05:00
Ashwin Bharambe
e8e8fe7c93 fix: add LLAMA_STACK_CLIENT_DIR mount when installing in docker from source 2025-02-24 10:00:57 -08:00
Ashwin Bharambe
641549c631 Add llama stack client overrides also; necessary for correct docker building 2025-02-24 07:51:11 -08:00
Ashwin Bharambe
0973d386e6 fix: update build_container.sh to ensure llama-models is installed first 2025-02-23 21:47:26 -08:00
Charlie Doern
34e3faa4e8
feat: add --run to llama stack build (#1156)
# What does this PR do?

--run runs the stack that was just build using the same arguments during
the build process (image-name, type, etc)

This simplifies the workflow a lot and makes the UX better for most
local users trying to get started rather than having to match the flags
of the two commands (build and then run)

Also, moved `ImageType` to distribution.utils since there were circular
import errors with its old location

## Test Plan

tested locally using the following command: 

`llama stack build --run --template ollama --image-type venv`

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-02-23 22:06:09 -05:00
Ashwin Bharambe
6227e1e3b9
fix: update virtualenv building so llamastack- prefix is not added, make notebook experience easier (#1225)
Make sure venv behaves like conda (no prefix is added to image_name) and
`--image-type venv` inside a notebook "just works" without any fiddling
2025-02-23 16:57:11 -08:00
Reid
187524d4ae
feat: add substring search for model list (#1099)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

`llama model list` or `llama model list --show-all` will list more or
all for the models, so add the `search` option to simplify the output.
```
$ llama model list --help
usage: llama model list [-h] [--show-all] [-s SEARCH]

Show available llama models

options:
  -h, --help            show this help message and exit
  --show-all            Show all models (not just defaults)
  -s SEARCH, --search SEARCH
                        Search for the input string as a substring in the model descriptor(ID)

$ llama model list -s 70b
+-----------------------+-----------------------------------+----------------+
| Model Descriptor(ID)  | Hugging Face Repo                 | Context Length |
+-----------------------+-----------------------------------+----------------+
| Llama3.1-70B          | meta-llama/Llama-3.1-70B          | 128K           |
+-----------------------+-----------------------------------+----------------+
| Llama3.1-70B-Instruct | meta-llama/Llama-3.1-70B-Instruct | 128K           |
+-----------------------+-----------------------------------+----------------+
| Llama3.3-70B-Instruct | meta-llama/Llama-3.3-70B-Instruct | 128K           |
+-----------------------+-----------------------------------+----------------+

$ llama model list -s 3.1-8b
+----------------------+----------------------------------+----------------+
| Model Descriptor(ID) | Hugging Face Repo                | Context Length |
+----------------------+----------------------------------+----------------+
| Llama3.1-8B          | meta-llama/Llama-3.1-8B          | 128K           |
+----------------------+----------------------------------+----------------+
| Llama3.1-8B-Instruct | meta-llama/Llama-3.1-8B-Instruct | 128K           |
+----------------------+----------------------------------+----------------+

$ llama model list --show-all -s pro
+----------------------+-----------------------------+----------------+
| Model Descriptor(ID) | Hugging Face Repo           | Context Length |
+----------------------+-----------------------------+----------------+
| Prompt-Guard-86M     | meta-llama/Prompt-Guard-86M | 2K             |
+----------------------+-----------------------------+----------------+

$ llama model list -s k
Not found for search.
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-21 16:38:10 -08:00
Ashwin Bharambe
45ffe87d7c Kill noise from test output 2025-02-21 15:37:23 -08:00
Ashwin Bharambe
e7d261ef4a Fix test infra, sentence embeddings mixin 2025-02-21 15:11:46 -08:00
Ashwin Bharambe
ab54b8cd58
feat(providers): support non-llama models for inference providers (#1200)
This PR begins the process of supporting non-llama models within Llama
Stack. We start simple by adding support for this functionality within a
few existing providers: fireworks, together and ollama.

## Test Plan

```bash
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/inference/test_text_inference.py \
  --inference-model accounts/fireworks/models/phi-3-vision-128k-instruct
```

^ this passes most of the tests but as expected fails the tool calling
related tests since they are very specific to Llama models

```
inference/test_text_inference.py::test_text_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_completion_log_probs_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_completion_log_probs_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_text_completion_structured_output[accounts/fireworks/models/phi-3-vision-128k-instruct-completion-01] PASSED
inference/test_text_inference.py::test_text_chat_completion_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-Which planet do humans live on?-Earth] PASSED
inference/test_text_inference.py::test_text_chat_completion_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-Which planet has rings around it with a name starting w
ith letter S?-Saturn] PASSED
inference/test_text_inference.py::test_text_chat_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-What's the name of the Sun in latin?-Sol] PASSED
inference/test_text_inference.py::test_text_chat_completion_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct-What is the name of the US captial?-Washington] PASSED
inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_required[accounts/fireworks/models/phi-3-vision-128k-instruct] FAILED
inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_none[accounts/fireworks/models/phi-3-vision-128k-instruct] PASSED
inference/test_text_inference.py::test_text_chat_completion_structured_output[accounts/fireworks/models/phi-3-vision-128k-instruct] ERROR
inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[accounts/fireworks/models/phi-3-vision-128k-instruct-True] PASSED
inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[accounts/fireworks/models/phi-3-vision-128k-instruct-False] PASSED
```
2025-02-21 13:21:28 -08:00
Sébastien Han
9bbe34694d
ci: add mypy for static type checking (#1101)
# What does this PR do?

- Enable mypy to run in the CI on a subset of the repository
- Fix a few mypy errors
- Run mypy from pre-commit

Signed-off-by: Sébastien Han <seb@redhat.com>
 
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-02-21 13:15:40 -08:00
ehhuang
25fddccfd8
feat: tool outputs metadata (#1155)
Summary:

Allows tools to output metadata. This is useful for evaluating tool
outputs, e.g. RAG tool will output document IDs, which can be used to
score recall.

Will need to make a similar change on the client side to support
ClientTool outputting metadata.

Test Plan:

LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/client-sdk/agents/test_agents.py
2025-02-21 13:15:31 -08:00
Ashwin Bharambe
36162c8c82 fix(ollama): register model with the helper first so it gets normalized 2025-02-21 12:51:38 -08:00
Xi Yan
0fe071764f
feat(1/n): api: unify agents for handling server & client tools (#1178)
# Problem

Our current Agent framework has discrepancies in definition on how we
handle server side and client side tools.

1. Server Tools: a single Turn is returned including `ToolExecutionStep`
in agenst
2. Client Tools: `create_agent_turn` is called in loop with client agent
lib yielding the agent chunk

ad6ffc63df/src/llama_stack_client/lib/agents/agent.py (L186-L211)

This makes it inconsistent to work with server & client tools. It also
complicates the logs to telemetry to get information about agents turn /
history for observability.

#### Principle
The same `turn_id` should be used to represent the steps required to
complete a user message including client tools.

## Solution

1. `AgentTurnResponseEventType.turn_awaiting_input` status to indicate
that the current turn is not completed, and awaiting tool input
2. `continue_agent_turn` endpoint to update agent turn with client's
tool response.


# What does this PR do?
- Skeleton API as example

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

- Just API update, no functionality change
```
llama stack run + client-sdk test
```

<img width="842" alt="image"
src="https://github.com/user-attachments/assets/7ac56b5f-f424-4632-9476-7e0f57555bc3"
/>


[//]: # (## Documentation)
2025-02-21 11:48:27 -08:00
Ashwin Bharambe
992f865b2e
chore: move embedding deps to RAG tool where they are needed (#1210)
`EMBEDDING_DEPS` were wrongly associated with `vector_io` providers.
They are needed by
https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/utils/memory/vector_store.py#L142
and related code and is used by the RAG tool and as such should only be
needed by the `inline::rag-runtime` provider.
2025-02-21 11:33:41 -08:00
Ashwin Bharambe
11697f85c5
fix: pull ollama embedding model if necessary (#1209)
Embedding models are tiny and can be pulled on-demand. Let's do that so
the user doesn't have to do "yet another thing" to get themselves set
up.

Thanks @hardikjshah for the suggestion.

Also fixed a build dependency miss (TODO: distro_codegen needs to
actually check that the build template contains all providers mentioned
for the run.yaml file)

## Test Plan 

First run `ollama rm all-minilm:latest`. 

Run `llama stack build --template ollama && llama stack run ollama --env
INFERENCE_MODEL=llama3.2:3b-instruct-fp16`. See that it outputs a
"Pulling embedding model `all-minilm:latest`" output and the stack
starts up correctly. Verify that `ollama list` shows the model is
correctly downloaded.
2025-02-21 10:35:56 -08:00
Jamie Land
840fae2259
fix: Updating images so that they are able to run without root access (#1208)
# What does this PR do?
Addresses issues where the container is unable to run as root. Gives
write access to required folders.

[//]: # (If resolving an issue, uncomment and update the line below)
(Closes #[1207])

## Test Plan
I built locally and ran `llama stack build --template remote-vllm
--image-type container` and validated I could see my changes in the
output:

```
#11 1.186 Installed 11 packages in 61ms
#11 1.186  + llama-models==0.1.3
#11 1.186  + llama-stack==0.1.3
#11 1.186  + llama-stack-client==0.1.3
#11 1.186  + markdown-it-py==3.0.0
#11 1.186  + mdurl==0.1.2
#11 1.186  + prompt-toolkit==3.0.50
#11 1.186  + pyaml==25.1.0
#11 1.186  + pygments==2.19.1
#11 1.186  + rich==13.9.4
#11 1.186  + tiktoken==0.9.0
#11 1.186  + wcwidth==0.2.13
#11 DONE 1.6s

#12 [ 9/10] RUN mkdir -p /.llama /.cache
#12 DONE 0.3s

#13 [10/10] RUN chmod -R g+rw /app /.llama /.cache
#13 DONE 0.3s

#14 exporting to image
#14 exporting layers
#14 exporting layers 3.7s done
#14 writing image sha256:11cc8bd954db6d036037bcaf471b173ddd5261ac4b1e72074cccf85d18aefb96 done
#14 naming to docker.io/library/distribution-remote-vllm:0.1.3 done
#14 DONE 3.7s
+ set +x
Success!
```
This is what the resulting image looks like:


![image](https://github.com/user-attachments/assets/070b9c05-b40f-4e7e-aa24-fef260c395e3)

Also tagged the image as `0.1.3-test` and [pushed to
quay](https://quay.io/repository/jland/distribution-remote-vllm?tab=tags)
(note there are a bunch of critical vulnerabilities we may want to look
into)

And for good measure I deployed the resulting image on my Openshift
environment using the default Security Context and validated that there
were no issue with it coming up.

My validation was all done with the `vllm-remote` distribution, but if I
am understanding everything correctly the other distributions are just
different run.yaml configs.


[//]: # (## Documentation)


Please let me know if there is anything else I need to do.

Co-authored-by: Jamie Land <hokie10@gmail.com>
2025-02-21 11:32:56 -05:00
Reid
9898589f12
fix: convert back to model descriptor for model in list --downloaded (#1201)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

Currently , `model` in `--downloaded` just use the directory(already
replace `:`), so covert back to descriptor keep the same with ` llama
model list`, and remove command also use `descriptor`.
```
before:
$ llama model list --downloaded
+-------------------------------------+----------+---------------------+
| Model                               | Size     | Modified Time       |
+-------------------------------------+----------+---------------------+
| Llama3.2-1B-Instruct-int4-qlora-eo8 | 1.53 GB  | 2025-02-20 16:32:49 |
+-------------------------------------+----------+---------------------+

after:
$ llama model list --downloaded
+-------------------------------------+----------+---------------------+
| Model                               | Size     | Modified Time       |
+-------------------------------------+----------+---------------------+
| Llama3.2-1B-Instruct:int4-qlora-eo8 | 1.53 GB  | 2025-02-20 16:32:49 |
+-------------------------------------+----------+---------------------+
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-21 08:10:34 -08:00
Rashmi Pawar
da9f0b7869
test(client-sdk): Update embedding test types to use latest imports (#1203)
# What does this PR do?
- Updates ImageContentItemImageURL import
- fixes `embedding_dimensions` metadata param

## Test Plan
- Ran pytest locally, verified embedding tests pass with new types

![Screenshot 2025-02-21 at 6 54
27 PM](https://github.com/user-attachments/assets/f80e3785-04c3-415e-9276-88aa8136bf00)

cc: @dglogo @sumitb
2025-02-21 08:09:17 -08:00
Reid
d2701b0d6a
chore: remove configure subcommand (#1202)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

When tried to use `configure`, and found it `DEPRECATED`, and found pr
https://github.com/meta-llama/llama-stack/pull/371 to remove it, not
sure why not remove the `configure.py`?
```
$ llama stack configure /tmp/test.yaml
usage: llama stack configure [-h] [--output-dir OUTPUT_DIR] config
llama stack configure: error:
    DEPRECATED! llama stack configure has been deprecated.
    Please use llama stack run <path/to/run.yaml> instead.
    Please see example run.yaml in /distributions folder.
```

It would better better to tell when user check it how to use with
`--help` first:

```
before:
$ llama stack configure --help
usage: llama stack configure [-h] [--output-dir OUTPUT_DIR] config

Configure a llama stack distribution

positional arguments:

after:
$ llama stack configure --help
usage: llama stack configure [-h] [--output-dir OUTPUT_DIR] config

Configure a llama stack distribution

    DEPRECATED! llama stack configure has been deprecated.
    Please use llama stack run <path/to/run.yaml> instead.
    Please see example run.yaml in /distributions folder.
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

---------

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-21 08:06:25 -08:00
Reid
c9c4a3c921
feat: model remove cmd (#1128)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

add a subcommand, help to clean the unneeded model:
```
$ llama model --help
usage: llama model [-h] {download,list,prompt-format,describe,verify-download,remove} ...

Work with llama models

options:
  -h, --help            show this help message and exit

$ llama model remove --help
usage: llama model remove [-h] -m MODEL [-f]

Remove the downloaded llama model

options:
  -h, --help            show this help message and exit
  -m MODEL, --model MODEL
                        Specify the llama downloaded model name
  -f, --force           Used to forcefully remove the llama model from the storage without further confirmation

$ llama model remove -m Llama3.2-1B-Instruct:int4-qlora-eo8
Are you sure you want to remove Llama3.2-1B-Instruct:int4-qlora-eo8? (y/n): n
Removal aborted.

$ llama model remove -mLlama3.2-1B-Instruct:int4-qlora-eo8-f
Llama3.2-1B-Instruct:int4-qlora-eo8 removed.
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

---------

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-21 08:05:12 -08:00
Ashwin Bharambe
81ce39a607
feat(api): Add options for supporting various embedding models (#1192)
We need to support:
- asymmetric embedding models (#934)
- truncation policies (#933)
- varying dimensional output (#932) 

## Test Plan

```bash
$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```
2025-02-20 22:27:12 -08:00
Ashwin Bharambe
6f9d622340
fix(api): update embeddings signature so inputs and outputs list align (#1161)
See Issue #922 

The change is slightly backwards incompatible but no callsite (in our
client codebases or stack-apps) every passes a depth-2
`List[List[InterleavedContentItem]]` (which is now disallowed.)

## Test Plan

```bash
$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```

Also ran `tests/client-sdk/inference/test_embeddings.py`
2025-02-20 21:43:13 -08:00
ehhuang
cfa752fc92
fix: pass tool_prompt_format to chat_formatter (#1198)
Summary:

Need this to format the completion message with tool_calls correctly.
See added unittest.

Test Plan:

python -m unittest
llama_stack.providers.tests.inference.test_prompt_adapter
2025-02-20 21:38:35 -08:00
Ashwin Bharambe
dd43494847 Fix inference test fixture 2025-02-20 21:24:49 -08:00
Ben Browning
6820718b71
fix: BuiltinTool JSON serialization in remote vLLM provider (#1183)
# What does this PR do?

The `tool_name` attribute of `ToolDefinition` instances can either be a
str or a BuiltinTool enum type. This fixes the remote vLLM provider to
use the value of those BuiltinTool enums when serializing to JSON
instead of attempting to serialize the actual enum to JSON.

Reference of how this is handled in some other areas, since I followed
that same pattern for the remote vLLM provider here:
- [remote nvidia
provider](https://github.com/meta-llama/llama-stack/blob/v0.1.3/llama_stack/providers/remote/inference/nvidia/openai_utils.py#L137-L140)
- [meta reference
provider](https://github.com/meta-llama/llama-stack/blob/v0.1.3/llama_stack/providers/inline/agents/meta_reference/agent_instance.py#L635-L636)

There is opportunity to potentially reconcile the remove nvidia and
remote vllm bits where they are both translating Llama Stack Inference
APIs to OpenAI client requests, but that's a can of worms I didn't want
to open for this bug fix.

This explicitly fixes this error when using the remote vLLM provider and
the agent tests:

```
TypeError: Object of type BuiltinTool is not JSON serializable
```

So, this is related to #1144 and addresses the immediate issue raised
there. With this fix,
`tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search`
now gets past the JSON serialization error when using the remote vLLM
provider and actually attempts to call the web search tool. I don't have
any API keys setup for the actual web search providers yet, so I cannot
verify everything works after that point.

## Test Plan

I ran the `test_builtin_tool_web_search` locally with the remote vLLM
provider like:
```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" LLAMA_STACK_CONFIG=remote-vllm python -m pytest -v tests/client-sdk/agents/test_agents.py::test_builtin_tool_web_search --inference-model "meta-llama/Llama-3.2-3B-Instruct"
```

Before my change, that reproduced the `TypeError: Object of type
BuiltinTool is not JSON serializable` error. After my change, that error
is gone and the test actually attempts the web search. That failed for
me locally, due to lack of API key, but it gets past the JSON
serialization error.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-02-20 21:18:37 -08:00
Ashwin Bharambe
35ae0e16a1 Fix sqlite_vec config defaults 2025-02-20 17:50:33 -08:00
Matthew Farrellee
832c535aaf
feat(providers): add NVIDIA Inference embedding provider and tests (#935)
# What does this PR do?

add /v1/inference/embeddings implementation to NVIDIA provider

**open topics** -
- *asymmetric models*. NeMo Retriever includes asymmetric models, which
are models that embed differently depending on if the input is destined
for storage or lookup against storage. the /v1/inference/embeddings api
does not allow the user to indicate the type of embedding to perform.
see https://github.com/meta-llama/llama-stack/issues/934
- *truncation*. embedding models typically have a limited context
window, e.g. 1024 tokens is common though newer models have 8k windows.
when the input is larger than this window the endpoint cannot perform
its designed function. two options: 0. return an error so the user can
reduce the input size and retry; 1. perform truncation for the user and
proceed (common strategies are left or right truncation). many users
encounter context window size limits and will struggle to write reliable
programs. this struggle is especially acute without access to the
model's tokenizer. the /v1/inference/embeddings api does not allow the
user to delegate truncation policy. see
https://github.com/meta-llama/llama-stack/issues/933
- *dimensions*. "Matryoshka" embedding models are available. they allow
users to control the number of embedding dimensions the model produces.
this is a critical feature for managing storage constraints. embeddings
of 1024 dimensions what achieve 95% recall for an application may not be
worth the storage cost if a 512 dimensions can achieve 93% recall.
controlling embedding dimensions allows applications to determine their
recall and storage tradeoffs. the /v1/inference/embeddings api does not
allow the user to control the output dimensions. see
https://github.com/meta-llama/llama-stack/issues/932

## Test Plan

- `llama stack run llama_stack/templates/nvidia/run.yaml`
- `LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v
tests/client-sdk/inference/test_embedding.py --embedding-model
baai/bge-m3`


## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.

---------

Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-02-20 16:59:48 -08:00
Ashwin Bharambe
2608b6074f Update embedding dimension singular 2025-02-20 16:14:46 -08:00
Ashwin Bharambe
9436dd570d
feat: register embedding models for ollama, together, fireworks (#1190)
# What does this PR do?

We have support for embeddings in our Inference providers, but so far we
haven't done the final step of actually registering the known embedding
models and making sure they are extremely easy to use. This is one step
towards that.

## Test Plan

Run existing inference tests.

```bash

$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```

The value of the EMBEDDING_DIMENSION isn't actually used in these tests,
it is merely used by the test fixtures to check if the model is an LLM
or Embedding.
2025-02-20 15:39:08 -08:00
Ashwin Bharambe
736560ceba Remove os.getenv() from ollama config 2025-02-20 14:30:32 -08:00
LESSuseLESS
2cbe9395b0
feat: D69478008 [llama-stack] turning tests into data-driven (#1180)
# What does this PR do?

We have several places running tests for different purposes.
- oss llama stack
  - provider tests
  - e2e tests
- provider llama stack
  - unit tests
  - e2e tests

It would be nice if they can *share the same set of test data*, so we
maintain the consistency between spec and implementation. This is what
this diff is about, isolating test data from test coding, so that we can
reuse the same data at different places by writing different test
coding.

## Test Plan

== Set up Ollama local server  
==  Run a provider test
conda activate stack

OLLAMA_URL="http://localhost:8321" \
pytest -v -s -k "ollama" --inference-model="llama3.2:3b-instruct-fp16" \

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output
// test_structured_output should also work

== Run an e2e test
conda activate sherpa
with-proxy pip install llama-stack
export INFERENCE_MODEL=llama3.2:3b-instruct-fp16
export LLAMA_STACK_PORT=8322
with-proxy llama stack build --template ollama
with-proxy llama stack run --env OLLAMA_URL=http://localhost:8321 ollama
  - Run test client,
LLAMA_STACK_PORT=8322 LLAMA_STACK_BASE_URL="http://localhost:8322" \
pytest -v -s --inference-model="llama3.2:3b-instruct-fp16" \

tests/client-sdk/inference/test_text_inference.py::test_text_completion_structured_output
// test_text_chat_completion_structured_output should also work

## Notes

- This PR was automatically generated by oss_sync
- Please refer to D69478008 for more details.
2025-02-20 14:13:06 -08:00
ehhuang
1166afdf76
fix: some telemetry APIs don't currently work (#1188)
Summary:

This bug is surfaced by using the http LS client. The issue is that
non-scalar values in 'GET' method are `body` params in fastAPI, but our
spec generation script doesn't respect that. We fix by just making them
POST method instead.

Test Plan:
Test API call with newly sync'd client
(https://github.com/meta-llama/llama-stack-client-python/pull/149)

<img width="1114" alt="image"
src="https://github.com/user-attachments/assets/7710aca5-d163-4e00-a465-14e6fcaac2b2"
/>
2025-02-20 14:09:25 -08:00
Xi Yan
ea1faae50e
chore!: deprecate eval/tasks (#1186)
# What does this PR do?
- Fully deprecate eval/tasks

[//]: # (If resolving an issue, uncomment and update the line below)
Closes #1088 

NOTE: this will be a breaking change. We have introduced the new API in
0.1.3 .

Notebook has been updated to use the new endpoints.

## Test Plan
```
pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb 
```
<img width="611" alt="image"
src="https://github.com/user-attachments/assets/79f6efe1-81ba-494e-bf36-1fc0c2b9bc6f"
/>



cc @SLR722  for awareness

[//]: # (## Documentation)
2025-02-20 14:06:21 -08:00
Ashwin Bharambe
07ccf908f7 ModelAlias -> ProviderModelEntry 2025-02-20 14:02:36 -08:00
Vladimir Ivić
f7161611c6
feat: adding endpoints for files and uploads (#1070)
Summary:
Adds spec definitions for file uploads operations.

This API focuses around two high level operations:
* Initiating and managing upload session
* Accessing uploaded file information

Usage examples:

To start a file upload session:
```
curl -X POST https://localhost:8321/v1/files \
-d '{
   "key": "image123.jpg',
   "bucket": "images",
   "mime_type": "image/jpg",
   "size": 12345
}'

# Returns
{
  “id”: <session_id>
  “url”: “https://localhost:8321/v1/files/session:<session_id>”,
  "offset": 0,
  "size": 12345
}

```

To upload file content to an existing session
```
curl -i -X POST "https://localhost:8321/v1/files/session:<session_id> \
  --data-binary @<path_to_local_file>

# Returns
{
  "key": "image123.jpg",
  "bucket": "images",
  "mime_type": "image/jpg",
  "bytes": 12345,
  "created_at": 1737492240
}

# Implementing on server side (Flask example for simplicity):
@app.route('/uploads/{upload_id}', methods=['POST'])
def upload_content_to_session(upload_id):
    try:
        # Get the binary file data from the request body
        file_data = request.data

        # Save the file to disk
        save_path = f"./uploads/{upload_id}"
        with open(save_path, 'wb') as f:
            f.write(file_data)
        return {__uploaded_file_json__}, 200
    except Exception as e:
        return 500

```

To read information about an existing upload session
```
curl -i -X GET "https://localhost:8321/v1/files/session:<session_id>

# Returns
{
  “id”: <session_id>
  “url”: “https://localhost:8321/v1/files/session:<session_id>”,
  "offset": 1024,
  "size": 12345
}
```

To list buckets
```
GET /files

# Returns
{
  "data": [
     {"name": "bucket1"},
     {"name": "bucket2"},
   ]
}
```

To list all files in a bucket
```
GET /files/{bucket}

# Returns
{
  "data": [
    {
      "key": "shiba.jpg",
      "bucket": "dogs",
      "mime_type": "image/jpg",
      "bytes": 82334,
      "created_at": 1737492240,
    },
    {
      "key": "persian_cat.jpg",
      "mime_type": "image/jpg",
      "bucket": "cats",
      "bytes": 39924,
      "created_at": 1727493440,
    },
  ]
}
```

To get specific file info
```
GET /files/{bucket}/{key}

{
  "key": "shiba.jpg",
  "bucket": "dogs",
  "mime_type": "image/jpg",
  "bytes": 82334,
  "created_at": 1737492240,
}

```

To delete specific file
```
DELETE /files/{bucket}/{key}

{
  "key": "shiba.jpg",
  "bucket": "dogs",
  "mime_type": "image/jpg",
  "bytes": 82334,
  "created_at": 1737492240,
}

```
2025-02-20 13:09:00 -08:00
Ashwin Bharambe
eddef0b2ae
chore: slight renaming of model alias stuff (#1181)
Quick test by running:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk
```
2025-02-20 11:48:46 -08:00