Commit graph

1500 commits

Author SHA1 Message Date
Fred Reiss
a8d0cdaf37
feat: updated inline vllm inference provider (#880)
# What does this PR do?

This PR updates the inline vLLM inference provider in several
significant ways:
* Models are now attached at run time to instances of the provider via
the `.../models` API instead of hard-coding the model's full name into
the provider's YAML configuration.
* The provider supports models that are not Meta Llama models. Any model
that vLLM supports can be loaded by passing Huggingface coordinates in
the "provider_model_id" field. Custom fine-tuned versions of Meta Llama
models can be loaded by specifying a path on local disk in the
"provider_model_id".
* To implement full chat completions support, including tool calling and
constrained decoding, the provider now routes the `chat_completions` API
to a captive (i.e. called directly in-process, not via HTTPS) instance
of vLLM's OpenAI-compatible server .
* The `logprobs` parameter and completions API are also working.

## Test Plan

Existing tests in
`llama_stack/providers/tests/inference/test_text_inference.py` have good
coverage of the new functionality. These tests can be invoked as
follows:

```
cd llama-stack && pytest \
    -vvv \
    llama_stack/providers/tests/inference/test_text_inference.py \
    --providers inference=vllm \
    --inference-model meta-llama/Llama-3.2-3B-Instruct
====================================== test session starts ======================================
platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12
cachedir: .pytest_cache
metadata: {'Python': '3.12.8', 'Platform': 'Linux-6.8.0-1016-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.2'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'}
rootdir: /mnt/datadisk1/freiss/llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.8.0, html-4.1.1, metadata-3.1.1, asyncio-0.25.2
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 9 items                                                                               

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm] PASSED [ 11%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm] PASSED [ 22%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm] PASSED [ 33%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm] PASSED [ 44%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm] PASSED [ 55%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm] PASSED [ 66%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm] PASSED [ 77%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm] PASSED [ 88%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm] PASSED [100%]

=========================== 9 passed, 13 warnings in 97.18s (0:01:37) ===========================

```

## Sources


## Before submitting

- [X] Ran pre-commit to handle lint / formatting issues.
- [X] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.

---------

Co-authored-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-03-07 13:38:23 -08:00
ehhuang
acbae66b9d
chore: escape tool output for logging (#1490)
Summary:

error:


llama_stack/providers/inline/agents/meta_reference/agent_instance.py:1032:
in execute_tool_call_maybe
    logger.info(f"tool call {name} completed with result: {result}")

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1841:
in info
    self.log(INFO, msg, *args, **kwargs)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1879:
in log
    self.logger.log(level, msg, *args, **kwargs)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1547:
in log
    self._log(level, msg, args, **kwargs)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1624:
in _log
    self.handle(record)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1634:
in handle
    self.callHandlers(record)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1696:
in callHandlers
    hdlr.handle(record)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:968:
in handle
    self.emit(record)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py:167:
in emit
    message_renderable = self.render_message(record, message)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py:193:
in render_message
message_text = Text.from_markup(message) if use_markup else
Text(message)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/text.py:287:
in from_markup
rendered_text = render(text, style, emoji=emoji,
emoji_variant=emoji_variant)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/markup.py:167:
in render
    raise MarkupError(
E rich.errors.MarkupError: closing tag '[/INST]' at position 3274
doesn't match any open tag

Test Plan:
2025-03-07 13:33:45 -08:00
Xi Yan
a55aab5958
fix: fix scoring tests (#1487)
# What does this PR do?
- fix scoring test

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/scoring/test_scoring.py --text-model meta-llama/Llama-3.3-70B-Instruct --judge-model meta-llama/Llama-3.3-70B-Instruct
```

<img width="1061" alt="image"
src="https://github.com/user-attachments/assets/740f9e6e-a654-4265-9db1-61481515a852"
/>


[//]: # (## Documentation)
2025-03-07 13:13:41 -08:00
Sébastien Han
e6355bfc3b
ci: enable Dependabot for GitHub Actions (#1470)
# What does this PR do?

Add a Dependabot configuration file (.github/dependabot.yml) to enable
automated dependency updates for GitHub Actions. This ensures workflows
stay up to date with the latest versions, improving security and
reliability.

Dependabot is configured to:
- Monitor GitHub Actions dependencies.
- Check for updates in the workflow directory
- Run updates on a daily schedule.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-07 12:54:56 -08:00
Xi Yan
5a2b9e121c
fix: return result for together's get_params (#1484)
# What does this PR do?

- return results for together's get_params
- fix issue
<img width="1538" alt="image"
src="https://github.com/user-attachments/assets/c4cd3802-85ef-4ff3-b2fd-76737be2e4ff"
/>

- the `return params` was accidentally deleted in
https://github.com/meta-llama/llama-stack/pull/1362/files#diff-d9345410ea64589cee96487b22eab0d45f7497a80c25dca295cecd254decb204

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
npm test examples
```


[//]: # (## Documentation)
2025-03-07 12:52:26 -08:00
ehhuang
1257288361
build: add 'tiktoken' to deps (#1483)
Summary:

Test Plan:
2025-03-07 12:36:02 -08:00
ehhuang
124e8d7cfe
build: include .md (#1482)
Summary:

Test Plan:
2025-03-07 12:10:52 -08:00
Ben Browning
d86a893ead
fix: Swap to AsyncOpenAI client in remote vllm provider (#1459)
# What does this PR do?

This switches from an OpenAI client to the AsyncOpenAI client in the
remote vllm provider. The main benefit of this is that instead of each
client call being a blocking operation that was blocking our server
event loop, the client calls are now async operations that do not block
the event loop.

The actual fix is quite simple and straightforward. Creating a reliable
reproducer of this with a unit test that verifies we were blocking the
event loop before and are not blocking it any longer was a bit harder.
Some other inference providers have this same issue, so we may want to
make that simple delayed http server a bit more generic and pull it into
a common place as other inference providers get fixed.

(Closes #1457)

## Test Plan

I verified the unit tests and test_text_inference tests pass with this
change like below:

```
python -m pytest -v tests/unit
```

```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v -s \
tests/integration/inference/test_text_inference.py \
--text-model "meta-llama/Llama-3.2-3B-Instruct"
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-03-07 14:48:00 -05:00
ehhuang
256448c14e
fix(cli): llama model prompt-format (#1481)
Summary:

+ llama model prompt-format -m Llama3.2-11B-Vision-Instruct
Traceback (most recent call last):
  File "/tmp/tmp.gCwyyCcjoA/.venv/bin/llama", line 10, in <module>
    sys.exit(main())
File
"/tmp/tmp.gCwyyCcjoA/.venv/lib/python3.10/site-packages/llama_stack/cli/llama.py",
line 50, in main
    parser.run(args)
File
"/tmp/tmp.gCwyyCcjoA/.venv/lib/python3.10/site-packages/llama_stack/cli/llama.py",
line 44, in run
    args.func(args)
File
"/tmp/tmp.gCwyyCcjoA/.venv/lib/python3.10/site-packages/llama_stack/cli/model/prompt_format.py",
line 59, in _run_model_template_cmd
    if args.list:
AttributeError: 'Namespace' object has no attribute 'list'

Test Plan:
llama model prompt-format -m Llama3.2-11B-Vision-Instruct
2025-03-07 11:45:54 -08:00
Sébastien Han
ffa32af930
build: bump llama-stack-client version (#1469)
## What does this PR do?

Use 0.1.5.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-07 11:42:38 -08:00
Sébastien Han
7cf1e24c4e
feat(logging): implement category-based logging (#1362)
# What does this PR do?

This commit introduces a new logging system that allows loggers to be
assigned
a category while retaining the logger name based on the file name. The
log
format includes both the logger name and the category, producing output
like:

```
INFO     2025-03-03 21:44:11,323 llama_stack.distribution.stack:103 [core]: Tool_groups: builtin::websearch served by
         tavily-search
```

Key features include:

- Category-based logging: Loggers can be assigned a category (e.g.,
  "core", "server") when programming. The logger can be loaded like
  this: `logger = get_logger(name=__name__, category="server")`
- Environment variable control: Log levels can be configured
per-category using the
  `LLAMA_STACK_LOGGING` environment variable. For example:
`LLAMA_STACK_LOGGING="server=DEBUG;core=debug"` enables DEBUG level for
the "server"
    and "core" categories.
- `LLAMA_STACK_LOGGING="all=debug"` sets DEBUG level globally for all
categories and
    third-party libraries.

This provides fine-grained control over logging levels while maintaining
a clean and
informative log format.

The formatter uses the rich library which provides nice colors better
stack traces like so:

```
ERROR    2025-03-03 21:49:37,124 asyncio:1758 [uncategorized]: unhandled exception during asyncio.run() shutdown
         task: <Task finished name='Task-16' coro=<handle_signal.<locals>.shutdown() done, defined at
         /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:146>
         exception=UnboundLocalError("local variable 'loop' referenced before assignment")>
         ╭────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────╮
         │ /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:178 in shutdown                │
         │                                                                                                                │
         │   175 │   │   except asyncio.CancelledError:                                                                   │
         │   176 │   │   │   pass                                                                                         │
         │   177 │   │   finally:                                                                                         │
         │ ❱ 178 │   │   │   loop.stop()                                                                                  │
         │   179 │                                                                                                        │
         │   180 │   loop = asyncio.get_running_loop()                                                                    │
         │   181 │   loop.create_task(shutdown())                                                                         │
         ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
         UnboundLocalError: local variable 'loop' referenced before assignment
```

Co-authored-by: Ashwin Bharambe <@ashwinb>
Signed-off-by: Sébastien Han <seb@redhat.com>

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

```
python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml
INFO     2025-03-03 21:55:35,918 __main__:365 [server]: Using config file: llama_stack/templates/ollama/run.yaml           
INFO     2025-03-03 21:55:35,925 __main__:378 [server]: Run configuration:                                                 
INFO     2025-03-03 21:55:35,928 __main__:380 [server]: apis:                                                              
         - agents                                                     
``` 
[//]: # (## Documentation)

---------

Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-03-07 11:34:30 -08:00
Sébastien Han
bad12ee21f
fix: remove ruff N999 (#1388)
# What does this PR do?

Since we moved the move tests/client-sdk to tests/api in
https://github.com/meta-llama/llama-stack/pull/1376. The N999 rule is
not needed anymore. And furthermore in

abfbaf3c1b

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-07 11:14:04 -08:00
ehhuang
fbd47bb4b6
feat(agent): plain function as client tool (#1479)
Summary:
support added in
https://github.com/meta-llama/llama-stack-client-python/pull/187

Test Plan:

LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
2025-03-07 11:10:07 -08:00
Charlie Doern
1097912054
refactor: display defaults in help text (#1480)
# What does this PR do?

using `formatter_class=argparse.ArgumentDefaultsHelpFormatter` displays
(default: DEFAULT_VALUE) for each flag. add this formatter class to
build and run to show users some default values like `conda`, `8321`,
etc

## Test Plan

ran locally with following output: 

before: 
```
llama stack run --help
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--disable-ipv6] [--env KEY=VALUE] [--tls-keyfile TLS_KEYFILE] [--tls-certfile TLS_CERTFILE]
                       [--image-type {conda,container,venv}]
                       config

Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.

positional arguments:
  config                Path to config file to use for the run

options:
  -h, --help            show this help message and exit
  --port PORT           Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. Defaults to 8321
  --image-name IMAGE_NAME
                        Name of the image to run. Defaults to the current conda environment
  --disable-ipv6        Disable IPv6 support
  --env KEY=VALUE       Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times.
  --tls-keyfile TLS_KEYFILE
                        Path to TLS key file for HTTPS
  --tls-certfile TLS_CERTFILE
                        Path to TLS certificate file for HTTPS
  --image-type {conda,container,venv}
                        Image Type used during the build. This can be either conda or container or venv.
```

after:
```
llama stack run --help
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--disable-ipv6] [--env KEY=VALUE] [--tls-keyfile TLS_KEYFILE] [--tls-certfile TLS_CERTFILE]
                       [--image-type {conda,container,venv}]
                       config

Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.

positional arguments:
  config                Path to config file to use for the run

options:
  -h, --help            show this help message and exit
  --port PORT           Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. (default: 8321)
  --image-name IMAGE_NAME
                        Name of the image to run. Defaults to the current conda environment (default: None)
  --disable-ipv6        Disable IPv6 support (default: False)
  --env KEY=VALUE       Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times. (default: [])
  --tls-keyfile TLS_KEYFILE
                        Path to TLS key file for HTTPS (default: None)
  --tls-certfile TLS_CERTFILE
                        Path to TLS certificate file for HTTPS (default: None)
  --image-type {conda,container,venv}
                        Image Type used during the build. This can be either conda or container or venv. (default: conda)
```

[//]: # (## Documentation)

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-03-07 11:05:58 -08:00
Xi Yan
b8c519ba11
feat: rag eval lifecycle notebook (#1458)
# What does this PR do?

- Add RAG eval lifecycle notebook
- Closes https://github.com/meta-llama/llama-stack/issues/1113

- Best reviewed in
https://github.com/meta-llama/llama-stack/blob/rag_eval_notebook/docs/notebooks/Llama_Stack_RAG_Lifecycle.ipynb

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
Run notebook

[//]: # (## Documentation)
2025-03-07 10:41:50 -08:00
Ihar Hrachyshka
511afe1381
chore: add pytest-report.xml to gitignore (#1473)
# What does this PR do?

Ignores `pytest-report.xml`. The file is produced by the unit tests
github workflow.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Not needed.

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-07 10:41:22 -08:00
Reid
40cd48fa09
chore: remove the incorrect output (#1472)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

Based on the client output changed, so the output is incorrect:

458e20702b/src/llama_stack_client/lib/cli/models/models.py (L52)
and
https://github.com/meta-llama/llama-stack/pull/1348#pullrequestreview-2654971315
previous discussion that no need to maintain the output, so remove it.

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-03-07 10:39:33 -08:00
Yuan Tang
c4b229f2c9
chore: Delete unused .gitmodules (#1460)
This is no longer needed after
https://github.com/meta-llama/llama-stack/pull/1265.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-07 10:38:55 -08:00
Yuan Tang
649d9bc26d
fix(security): Bump jinja2 to >=3.1.6 (#1461)
This addresses the new vulnerability
https://github.com/advisories/GHSA-cpwx-vrp4-4pq7.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-07 10:38:39 -08:00
Botao Chen
4dccf916d1
feat: open benchmark template and doc (#1465)
## What does this PR do?
- Provide a distro template to let developer easily run the open
benchmarks llama stack supports on llama and non-llama models.
- Provide doc on how to run open benchmark eval via CLI and open
benchmark contributing guide

[//]: # (If resolving an issue, uncomment and update the line below)
(Closes #1375 )

## Test Plan
open benchmark eval results on llama, gpt, gemini and clause
<img width="771" alt="Screenshot 2025-03-06 at 7 33 05 PM"
src="https://github.com/user-attachments/assets/1bd85456-b9b9-4b37-af76-4ce1d2bac00e"
/>

doc preview
<img width="944" alt="Screenshot 2025-03-06 at 7 33 58 PM"
src="https://github.com/user-attachments/assets/f4e5866d-b395-4c40-aa8b-080edeb5cdb6"
/>
<img width="955" alt="Screenshot 2025-03-06 at 7 34 04 PM"
src="https://github.com/user-attachments/assets/629defb6-d5e4-473c-aa03-308bce386fb4"
/>

<img width="965" alt="Screenshot 2025-03-06 at 7 35 29 PM"
src="https://github.com/user-attachments/assets/c21ff96c-9e8c-4c54-b6b8-25883125f4cf"
/>

<img width="957" alt="Screenshot 2025-03-06 at 7 35 37 PM"
src="https://github.com/user-attachments/assets/47571c90-1381-4e2c-bbed-c4f3a60578d0"
/>
2025-03-07 10:37:55 -08:00
Ashwin Bharambe
290cc843fc
test: first unit test for resolver (#1475)
Starting to create unit tests to cover critical (and mostly
undocumented) provider resolution and routing logic.

## Test Plan

Unit tests
2025-03-07 10:20:51 -08:00
Dinesh Yeduguru
60e7f3d705
fix: Revert "feat: record token usage for inference API (#1300)" (#1476)
This reverts commit b8535417e0.

Test plan:
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run
~/.llama/distributions/together/together-run.yaml
python -m examples.agents.e2e_loop_with_client_tools localhost 8321
2025-03-07 10:16:47 -08:00
Yuan Tang
df4fbae35c
ci: Add script to generate changelog (#1463) 2025-03-07 12:45:08 -05:00
Ashwin Bharambe
4d9fe25bbf fix: fetched latest pypi version when building documentation 2025-03-06 21:15:15 -08:00
Ashwin Bharambe
330cc9d09d
feat: add Milvus vectorDB (#1467)
# What does this PR do?
See https://github.com/meta-llama/llama-stack/pull/1171 which is the
original PR. Author: @zc277584121

feat: add [Milvus](https://milvus.io/) vectorDB

note: I use the MilvusClient to implement it instead of
AsyncMilvusClient, because when I tested AsyncMilvusClient, it would
raise issues about evenloop, which I think AsyncMilvusClient SDK is not
robust enough to be compatible with llama_stack framework.

## Test Plan
have passed the unit test and ene2end test
Here is my end2end test logs, including the client code, client log,
server logs from inline and remote settings

[test_end2end_logs.zip](https://github.com/user-attachments/files/18964391/test_end2end_logs.zip)

---------

Signed-off-by: ChengZi <chen.zhang@zilliz.com>
Co-authored-by: Cheney Zhang <chen.zhang@zilliz.com>
2025-03-06 20:59:31 -08:00
Xi Yan
1e3be1e4d7
fix: fix agent test recorded responses (#1462)
# What does this PR do?

- re-gen to fix agents test
- update test_custom_tool

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/agents/test_agents.py --text-model meta-llama/Llama-3.3-70B-Instruct
```

<img width="1294" alt="image"
src="https://github.com/user-attachments/assets/63521532-b989-4cf2-8fe5-c7f057f1c4dc"
/>


[//]: # (## Documentation)
2025-03-06 19:37:52 -08:00
Ihar Hrachyshka
8234cdf1a5
fix(deps): move chardet and pypdf imports inline where used (#1434)
# What does this PR do?

Fix import errors due to `chardet` and `pypdf` not being installed while
imported from `url_utils.py`.

Closes #1432

## Test Plan

Now able to run the server with the config.

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-06 17:09:14 -08:00
Sébastien Han
803bf0e029
fix: solve ruff B008 warnings (#1444)
# What does this PR do?

The commit addresses the Ruff warning B008 by refactoring the code to
avoid calling SamplingParams() directly in function argument defaults.
Instead, it either uses Field(default_factory=SamplingParams) for
Pydantic models or sets the default to None and instantiates
SamplingParams inside the function body when the argument is None.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-06 16:48:35 -08:00
Xi Yan
3a454be9b2
docs: add back eval concept doc (#1456)
# What does this PR do?

- add eval concept doc in Core Concept tab

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

<img width="1266" alt="image"
src="https://github.com/user-attachments/assets/8eb06a49-3c04-4899-805c-1b5349471f1f"
/>

cc @SLR722 

[//]: # (## Documentation)
2025-03-06 15:47:20 -08:00
ehhuang
ca2910d27a
docs: update test_agents to use new Agent SDK API (#1402)
# Summary:
new Agent SDK API is added in
https://github.com/meta-llama/llama-stack-client-python/pull/178

Update docs and test to reflect this.

Closes https://github.com/meta-llama/llama-stack/issues/1365

# Test Plan:
```bash
py.test -v -s --nbval-lax ./docs/getting_started.ipynb

LLAMA_STACK_CONFIG=fireworks \
   pytest -s -v tests/integration/agents/test_agents.py \
  --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct
```
2025-03-06 15:21:12 -08:00
ehhuang
3d71e5a036
test: recordable mocks use json only (#1443)
# Summary:
removes the use of pickle

# Test Plan:
Run the following with `--record-responses` first, then another time
without.

LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
2025-03-06 14:46:29 -08:00
Xi Yan
564977c646
docs: update eval doc (#1453)
# What does this PR do?
- Update eval doc to reflect latest changes
- Closes https://github.com/meta-llama/llama-stack/issues/1441

## Test Plan
read

[//]: # (## Documentation)
2025-03-06 14:14:10 -08:00
Reid
db4ee7a9ff
docs: improve rag doc (#1411)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-03-06 14:03:52 -08:00
Xi Yan
1a95271fab
fix: notebook vision inference (#1423)
# What does this PR do?

- update to use library client throughout 

cc @jeffxtang 

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
pytest -v -s --nbval-lax ./docs/getting_started.ipynb
```

[//]: # (## Documentation)
2025-03-06 13:40:21 -08:00
ehhuang
46bc5f4a7a
chore: log exception (#1452)
Summary:

Test Plan:
<img width="1236" alt="image"
src="https://github.com/user-attachments/assets/facc43ba-85ff-42e4-8e04-b7970c630c4d"
/>
2025-03-06 11:42:51 -08:00
Sébastien Han
4bbb4ddeae
fix: resolve pydantic warning on .dict() usage (#1445)
# What does this PR do?

The method "dict" in class "BaseModel" is deprecated we should use
model_dump instead.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-06 11:27:47 -08:00
Ashwin Bharambe
e8071b54dc fix: no skip_logger_removal for non-library client 2025-03-06 11:04:56 -08:00
Yuan Tang
14c9ebbae5
docs: Add CHANGELOG.md (#1440)
# What does this PR do?

@raghotham @ashwinb @yanxi0830 This adds a single changelog doc for
easier browsing based on our previous discussions.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-06 13:57:24 -05:00
Charlie Doern
8d86137ab2
docs: add information on how to set log level before running (#1430)
# What does this PR do?

currently logcat is not documented for build && run. Add documentation
in building_distro.md

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-03-06 10:54:14 -08:00
Xi Yan
bcb13c492f
test: revamp eval related integration tests (#1433)
# What does this PR do?
- revamp and clean up datasets/scoring/eval integration tests
- closes https://github.com/meta-llama/llama-stack/issues/1396

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
**dataset**
```
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/integration/datasetio/
```
<img width="842" alt="image"
src="https://github.com/user-attachments/assets/88fc2b6a-b496-47bf-bc0c-8fea48ba36ff"
/>

**scoring**
```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/scoring --text-model meta-llama/Llama-3.1-8B-Instruct --judge-model meta-llama/Llama-3.1-8B-Instruct
```
<img width="851" alt="image"
src="https://github.com/user-attachments/assets/50f46415-b44c-4c37-a6c3-076f2767adb3"
/>


**eval**
```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/eval --text-model meta-llama/Llama-3.1-8B-Instruct --judge-model meta-llama/Llama-3.1-8B-Instruct
```
<img width="841" alt="image"
src="https://github.com/user-attachments/assets/8eb1c65c-3b39-4d66-8ff4-f471ca783e49"
/>


[//]: # (## Documentation)
2025-03-06 10:51:35 -08:00
Ashwin Bharambe
82e94fe22f
ci: add Github workflow which runs unittests in PR (#1442) 2025-03-05 21:23:28 -05:00
Ashwin Bharambe
e6ae557661 fix: update testing documentation 2025-03-05 17:41:13 -08:00
Ashwin Bharambe
2fe976ed0a
refactor(test): introduce --stack-config and simplify options (#1404)
You now run the integration tests with these options:

```bash
Custom options:
  --stack-config=STACK_CONFIG
                        a 'pointer' to the stack. this can be either be:
                        (a) a template name like `fireworks`, or
                        (b) a path to a run.yaml file, or
                        (c) an adhoc config spec, e.g.
                        `inference=fireworks,safety=llama-guard,agents=meta-
                        reference`
  --env=ENV             Set environment variables, e.g. --env KEY=value
  --text-model=TEXT_MODEL
                        comma-separated list of text models. Fixture name:
                        text_model_id
  --vision-model=VISION_MODEL
                        comma-separated list of vision models. Fixture name:
                        vision_model_id
  --embedding-model=EMBEDDING_MODEL
                        comma-separated list of embedding models. Fixture name:
                        embedding_model_id
  --safety-shield=SAFETY_SHIELD
                        comma-separated list of safety shields. Fixture name:
                        shield_id
  --judge-model=JUDGE_MODEL
                        comma-separated list of judge models. Fixture name:
                        judge_model_id
  --embedding-dimension=EMBEDDING_DIMENSION
                        Output dimensionality of the embedding model to use for
                        testing. Default: 384
  --record-responses    Record new API responses instead of using cached ones.
  --report=REPORT       Path where the test report should be written, e.g.
                        --report=/path/to/report.md

```

Importantly, if you don't specify any of the models (text-model,
vision-model, etc.) the relevant tests will get **skipped!**

This will make running tests somewhat more annoying since all options
will need to be specified. We will make this easier by adding some easy
wrapper yaml configs.

## Test Plan

Example:

```bash
ashwin@ashwin-mbp ~/local/llama-stack/tests/integration (unify_tests) $ 
LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/test_text_inference.py \
   --text-model meta-llama/Llama-3.2-3B-Instruct 
```
2025-03-05 17:02:02 -08:00
Reid
a0d6b165b0
chore: remove unused build dir (#1379)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

- From old PR, it use `BUILDS_BASE_DIR` in
`llama_stack/cli/stack/configure.py`(removed).
https://github.com/meta-llama/llama-stack/pull/371/files
- Based on the current `build` code, it should only use
`DISTRIBS_BASE_DIR` to save it.

46b0a404e8/llama_stack/cli/stack/_build.py (L298)

46b0a404e8/llama_stack/cli/stack/_build.py (L301)
Pls correct me if I am understand incorrectly.
So it should no need to use in `run` now.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-03-05 15:40:00 -08:00
Ihar Hrachyshka
4d4be03176
fix: don't import from llama_models (#1436)
# What does this PR do?

Some imports were not switched to in-tree copy of the modules.

This is a follow-up to:
https://github.com/meta-llama/llama-stack/pull/1344

Closes #1435

## Test Plan

Manually started the server...

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-05 15:30:38 -08:00
ehhuang
6cf79437b3
feat: support ClientTool output metadata (#1426)
# Summary:
Client side change in
https://github.com/meta-llama/llama-stack-client-python/pull/180
Changes the resume_turn API to accept `ToolResponse` instead of
`ToolResponseMessage`:
1. `ToolResponse` contains `metadata`
2. `ToolResponseMessage` is a concept for model inputs. Here we are just
submitting the outputs of tool execution.

# Test Plan:
Ran integration tests with newly added test using client tool with
metadata

LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --record-responses
2025-03-05 14:30:27 -08:00
Ben Browning
ac717f38dc
chore: Reduce flakes in test_text_inference on smaller models (#1428)
# What does this PR do?

When running `tests/integration/inference/test_text_inference.py` on
smaller models, such as Llama-3.2-3B-Instruct, I sometimes get test
flakes where the model passes "San Francisco" as an argument to my tool
call instead of "San Francisco, CA" which is what we expect.

So, this expands upon that tool calling parameter's description to
explicitly state that both city and state are required. With this
change, the tool calling tests that are checking for this "San
Francisco, CA" value are always passing for me instead of sometimes
failing.

## Test Plan

I test this locally via vLLM like:

```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v \
tests/integration/inference/test_text_inference.py \
--inference-model "meta-llama/Llama-3.2-3B-Instruct" \
--vision-inference-model ""
```

I don't expect this would negatively impact the parameter generated for
this tool call by other models, as we're providing additional guidance
but not removing any of the existing guidance. However, I cannot easily
confirm that myself.

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-03-05 13:05:30 -08:00
Dinesh Yeduguru
b8535417e0
feat: record token usage for inference API (#1300)
# What does this PR do?
Inference router computes the token usage related metrics for all
providers and returns the metrics as part of response and also logs to
telemetry.

## Test Plan
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run
~/.llama/distributions/fireworks/fireworks-run.yaml

```
curl --request POST \
  --url http://localhost:8321/v1/inference/chat-completion \
  --header 'content-type: application/json' \
  --data '{
  "model_id": "meta-llama/Llama-3.1-70B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": {
        "type": "text",
        "text": "where do humans live"
      }
    }
  ],
  "stream": false
}' | jq .
{
  "metrics": [
    {
      "trace_id": "yjv1tf0jS1evOyPm",
      "span_id": "WqYKvg0_",
      "timestamp": "2025-02-27T18:55:10.770903Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "prompt_tokens",
      "value": 10,
      "unit": "tokens"
    },
    {
      "trace_id": "yjv1tf0jS1evOyPm",
      "span_id": "WqYKvg0_",
      "timestamp": "2025-02-27T18:55:10.770916Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "completion_tokens",
      "value": 411,
      "unit": "tokens"
    },
    {
      "trace_id": "yjv1tf0jS1evOyPm",
      "span_id": "WqYKvg0_",
      "timestamp": "2025-02-27T18:55:10.770919Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "total_tokens",
      "value": 421,
      "unit": "tokens"
    }
  ],
  "completion_message": {
    "role": "assistant",
    "content": "Humans live in various parts of the world, inhabiting almost every continent, country, and region. Here's a breakdown of where humans live:\n\n1. **Continents:** Humans inhabit all seven continents:\n\t* Africa\n\t* Antarctica (research stations only)\n\t* Asia\n\t* Australia\n\t* Europe\n\t* North America\n\t* South America\n2. **Countries:** There are 196 countries recognized by the United Nations, and humans live in almost all of them.\n3. **Regions:** Humans live in diverse regions, including:\n\t* Deserts (e.g., Sahara, Mojave)\n\t* Forests (e.g., Amazon, Congo)\n\t* Grasslands (e.g., Prairies, Steppes)\n\t* Mountains (e.g., Himalayas, Andes)\n\t* Oceans (e.g., coastal areas, islands)\n\t* Tundras (e.g., Arctic, sub-Arctic)\n4. **Cities and towns:** Many humans live in urban areas, such as cities and towns, which are often located near:\n\t* Coastlines\n\t* Rivers\n\t* Lakes\n\t* Mountains\n5. **Rural areas:** Some humans live in rural areas, such as:\n\t* Villages\n\t* Farms\n\t* Countryside\n6. **Islands:** Humans inhabit many islands, including:\n\t* Tropical islands (e.g., Hawaii, Maldives)\n\t* Arctic islands (e.g., Greenland, Iceland)\n\t* Continental islands (e.g., Great Britain, Ireland)\n7. **Extreme environments:** Humans also live in extreme environments, such as:\n\t* High-altitude areas (e.g., Tibet, Andes)\n\t* Low-altitude areas (e.g., Death Valley, Dead Sea)\n\t* Areas with extreme temperatures (e.g., Arctic, Sahara)\n\nOverall, humans have adapted to live in a wide range of environments and ecosystems around the world.",
    "stop_reason": "end_of_turn",
    "tool_calls": []
  },
  "logprobs": null
}
```

```
 LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/integration/inference

======================================================================== short test summary info =========================================================================
FAILED tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B:vis=11B-inference:chat_completion:tool_calling_tools_absent-True] - ValueError: Unsupported tool prompt format: ToolPromptFormat.json
FAILED tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B:vis=11B-inference:chat_completion:tool_calling_tools_absent-False] - ValueError: Unsupported tool prompt format: ToolPromptFormat.json
FAILED tests/integration/inference/test_vision_inference.py::test_image_chat_completion_non_streaming[txt=8B:vis=11B] - fireworks.client.error.InvalidRequestError: {'error': {'object': 'error', 'type': 'invalid_request_error', 'message': 'Failed to decode image cannot identify image f...
FAILED tests/integration/inference/test_vision_inference.py::test_image_chat_completion_streaming[txt=8B:vis=11B] - fireworks.client.error.InvalidRequestError: {'error': {'object': 'error', 'type': 'invalid_request_error', 'message': 'Failed to decode image cannot identify image f...
========================================================= 4 failed, 16 passed, 23 xfailed, 17 warnings in 44.36s =========================================================
```
2025-03-05 12:41:45 -08:00
Ben Browning
9c4074ed49
fix: Gracefully handle no choices in remote vLLM response (#1424)
# What does this PR do?

This gracefully handles the case where the vLLM server responded to a
completion request with no choices, which can happen in certain vLLM
error situations. Previously, we'd error out with a stack trace about a
list index out of range. Now, we just log a warning to the user and move
past any chunks with an empty choices list.

A specific example of the type of stack trace this fixes:

```
  File "/app/llama-stack-source/llama_stack/providers/remote/inference/vllm/vllm.py", line 170, in _process_vllm_chat_completion_stream_response
    choice = chunk.choices[0]
             ~~~~~~~~~~~~~^^^
IndexError: list index out of range
```

Now, instead of erroring out with that stack trace, we log a warning
that vLLM failed to generate any completions and alert the user to check
the vLLM server logs for details.

This is related to #1277 and addresses the stack trace shown in that
issue, although does not in and of itself change the functional behavior
of vLLM tool calling.

## Test Plan

As part of this fix, I added new unit tests to trigger this same error
and verify it no longer happens. That is
`test_process_vllm_chat_completion_stream_response_no_choices` in the
new `tests/unit/providers/inference/test_remote_vllm.py`. I also added a
couple of more tests to trigger and verify the last couple of remote
vllm provider bug fixes - specifically a test for #1236 (builtin tool
calling) and #1325 (vLLM <= v0.6.3).

This required fixing the signature of
`_process_vllm_chat_completion_stream_response` to accept the actual
type of chunks it was getting passed - specifically changing from our
openai_compat `OpenAICompatCompletionResponse` to
`openai.types.chat.chat_completion_chunk.ChatCompletionChunk`. It was
not actually getting passed `OpenAICompatCompletionResponse` objects
before, and was using attributes that didn't exist on those objects. So,
the signature now matches the type of object it's actually passed.

Run these new unit tests like this:

```
pytest tests/unit/providers/inference/test_remote_vllm.py
```

Additionally, I ensured the existing `test_text_inference.py` tests
passed via:

```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v tests/integration/inference/test_text_inference.py \
--inference-model "meta-llama/Llama-3.2-3B-Instruct" \
--vision-inference-model ""
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-03-05 15:07:54 -05:00
Xi Yan
bcc5370d2e
feat: effective agent workflow notebook (#1372)
# What does this PR do?

- Add Notebook: Build and Monitor Agent Workflows with Llama Stack +
Anthropic's Best Practice
- Better reviewed in:
https://github.com/meta-llama/llama-stack/blob/effective_agents/docs/notebooks/Llama_Stack_Agent_Workflows.ipynb
- Closes https://github.com/meta-llama/llama-stack/issues/1371

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

```
pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Agent_Workflows.ipynb
```

<img width="671" alt="image"
src="https://github.com/user-attachments/assets/e5a7e312-ab3d-406a-a0f8-3b1d836e7b46"
/>


[//]: # (## Documentation)
2025-03-05 11:53:25 -08:00