Commit graph

1590 commits

Author SHA1 Message Date
Kelly Brown
d33b8ea3dc
docs: Small nits in llama CLI reference (#1542)
**Description:** Fixes some small nits in the llama CLI reference
Note: There are a few nits in this PR, but also has some small
suggestions, feel free to close if not necessary
2025-03-11 10:12:18 -07:00
Ihar Hrachyshka
c3d7d17bc4
chore: fix typing hints for get_provider_impl deps arguments (#1544)
# What does this PR do?

It's a dict that may contain different types, as per
resolver:instantiate_provider implementation. (AFAIU it also never
contains ProviderSpecs, but *instances* of provider implementations.)

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

mypy passing if enabled checks for these modules. (See #1543)

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-11 10:07:28 -07:00
Ihar Hrachyshka
04106b94aa
docs: Remove duplicate docs on api docs generator (#1534)
# What does this PR do?

Since #892, we also need to install ruamel. Instead of maintaining the
list of script dependencies in multiple places, remove it and assume
developers read CONTRIBUTING.md docs.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Just docs.

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-11 10:01:46 -07:00
Ihar Hrachyshka
0e73186a11
fix: Add missing shutdown handler for TorchtunePostTrainingImpl (#1535)
# What does this PR do?

Added missing shutdown handler. (Currently empty.)

Without it, when server shuts down, it posts the following warning:

```
__main__:129 server: No shutdown method for TorchtunePostTrainingImpl
```

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>


[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

(The test plan assumes shutdown logic is fixed, see #1495)

Without the patch:

```
INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO     2025-03-10 20:56:43,961 __main__:140 server: Shutting down
INFO     2025-03-10 20:56:43,962 __main__:124 server: Shutting down DatasetsRoutingTable
INFO     2025-03-10 20:56:43,964 __main__:124 server: Shutting down DatasetIORouter
INFO     2025-03-10 20:56:43,965 __main__:124 server: Shutting down ScoringFunctionsRoutingTable
INFO     2025-03-10 20:56:43,966 __main__:124 server: Shutting down ScoringRouter
INFO     2025-03-10 20:56:43,967 __main__:124 server: Shutting down ModelsRoutingTable
INFO     2025-03-10 20:56:43,968 __main__:124 server: Shutting down InferenceRouter
INFO     2025-03-10 20:56:43,969 __main__:124 server: Shutting down ShieldsRoutingTable
INFO     2025-03-10 20:56:43,971 __main__:124 server: Shutting down SafetyRouter
INFO     2025-03-10 20:56:43,972 __main__:124 server: Shutting down VectorDBsRoutingTable
INFO     2025-03-10 20:56:43,973 __main__:124 server: Shutting down VectorIORouter
INFO     2025-03-10 20:56:43,974 __main__:124 server: Shutting down ToolGroupsRoutingTable
INFO     2025-03-10 20:56:43,975 __main__:124 server: Shutting down ToolRuntimeRouter
INFO     2025-03-10 20:56:43,976 __main__:124 server: Shutting down MetaReferenceAgentsImpl
INFO     2025-03-10 20:56:43,977 __main__:124 server: Shutting down TelemetryAdapter
INFO     2025-03-10 20:56:43,978 __main__:124 server: Shutting down TorchtunePostTrainingImpl
WARNING  2025-03-10 20:56:43,979 __main__:129 server: No shutdown method for TorchtunePostTrainingImpl
INFO     2025-03-10 20:56:43,979 __main__:124 server: Shutting down BenchmarksRoutingTable
INFO     2025-03-10 20:56:43,980 __main__:124 server: Shutting down EvalRouter
INFO     2025-03-10 20:56:43,981 __main__:124 server: Shutting down DistributionInspectImpl
INFO:     Application shutdown complete.
INFO:     Finished server process [33862]
```

Run with the patch and observe no warning:

```
$ kill -INT $(ps ax | grep  llama_stack.distribution.server.server | grep -v nvim | awk -e '{print $1}' | sort | head -n 1)
```

```
INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO     2025-03-11 00:32:56,863 __main__:140 server: Shutting down
INFO     2025-03-11 00:32:56,864 __main__:124 server: Shutting down DatasetsRoutingTable
INFO     2025-03-11 00:32:56,866 __main__:124 server: Shutting down DatasetIORouter
INFO     2025-03-11 00:32:56,867 __main__:124 server: Shutting down ScoringFunctionsRoutingTable
INFO     2025-03-11 00:32:56,868 __main__:124 server: Shutting down ScoringRouter
INFO     2025-03-11 00:32:56,869 __main__:124 server: Shutting down ModelsRoutingTable
INFO     2025-03-11 00:32:56,870 __main__:124 server: Shutting down InferenceRouter
INFO     2025-03-11 00:32:56,871 __main__:124 server: Shutting down ShieldsRoutingTable
INFO     2025-03-11 00:32:56,872 __main__:124 server: Shutting down SafetyRouter
INFO     2025-03-11 00:32:56,873 __main__:124 server: Shutting down VectorDBsRoutingTable
INFO     2025-03-11 00:32:56,874 __main__:124 server: Shutting down VectorIORouter
INFO     2025-03-11 00:32:56,875 __main__:124 server: Shutting down ToolGroupsRoutingTable
INFO     2025-03-11 00:32:56,876 __main__:124 server: Shutting down ToolRuntimeRouter
INFO     2025-03-11 00:32:56,877 __main__:124 server: Shutting down MetaReferenceAgentsImpl
INFO     2025-03-11 00:32:56,878 __main__:124 server: Shutting down TelemetryAdapter
INFO     2025-03-11 00:32:56,879 __main__:124 server: Shutting down TorchtunePostTrainingImpl
INFO     2025-03-11 00:32:56,880 __main__:124 server: Shutting down BenchmarksRoutingTable
INFO     2025-03-11 00:32:56,881 __main__:124 server: Shutting down EvalRouter
INFO     2025-03-11 00:32:56,882 __main__:124 server: Shutting down DistributionInspectImpl

```

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-11 10:01:09 -07:00
Ashwin Bharambe
e13c92f269
revert: feat(server): Use system packages for execution (#1551)
Reverts meta-llama/llama-stack#1252

The above PR breaks the following invocation:
```bash
llama stack run ~/.llama/distributions/together/together-run.yaml
```
2025-03-11 09:58:25 -07:00
Dinesh Yeduguru
ead9397e22
fix: tracing fixes for trace context propogation across coroutines (#1522)
# What does this PR do?
This PR has two fixes needed for correct trace context propagation
across asycnio boundary
Fix 1: Start using context vars to store the global trace context.
This is needed since we cannot use the same trace context across
coroutines since the state is shared. each coroutine
should have its own trace context so that each of it can start storing
its state correctly.
Fix 2: Start a new span for each new coroutines started for running
shields to keep the span tree clean


## Test Plan

### Integration tests with server
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run
~/.llama/distributions/together/together-run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -s --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
server logs:
https://gist.github.com/dineshyv/51ac5d9864ed031d0d89ce77352821fe
test logs:
https://gist.github.com/dineshyv/e66acc1c4648a42f1854600609c467f3
 
### Integration tests with library client
LLAMA_STACK_CONFIG=fireworks pytest -s --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct

logs: https://gist.github.com/dineshyv/ca160696a0b167223378673fb1dcefb8

### Apps test with server:
```
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/together/together-run.yaml
python -m examples.agents.e2e_loop_with_client_tools localhost 8321
```
server logs:
https://gist.github.com/dineshyv/1717a572d8f7c14279c36123b79c5797
app logs:
https://gist.github.com/dineshyv/44167e9f57806a0ba3b710c32aec02f8
2025-03-11 07:12:48 -07:00
Botao Chen
e3edca7739
feat: [new open benchmark] Math 500 (#1538)
## What does this PR do?
Created a new math_500 open-benchmark based on OpenAI's [Let's Verify
Step by Step](https://arxiv.org/abs/2305.20050) paper and hugging face's
[HuggingFaceH4/MATH-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500)
dataset.

The challenge part of this benchmark is to parse the generated and
expected answer and verify if they are same. For the parsing part, we
refer to [Minerva: Solving Quantitative Reasoning Problems with Language
Models](https://research.google/blog/minerva-solving-quantitative-reasoning-problems-with-language-models/).

To simply the parse logic, as the next step, we plan to also refer to
what [simple-eval](https://github.com/openai/simple-evals) is doing,
using llm as judge to check if the generated answer matches the expected
answer or not


## Test Plan
on sever side, spin up a server with open-benchmark template `llama
stack run llama_stack/templates/open-benchamrk/run.yaml`

on client side, issue an open benchmark eval request `llama-stack-client
--endpoint xxx eval run-benchmark "meta-reference-math-500" --model-id
"meta-llama/Llama-3.3-70B-Instruct" --output-dir "/home/markchen1015/"
--num-examples 20` and get ther aggregated eval results
<img width="238" alt="Screenshot 2025-03-10 at 7 57 04 PM"
src="https://github.com/user-attachments/assets/2c9da042-3b70-470e-a7c4-69f4cc24d1fb"
/>

check the generated answer and the related scoring and they make sense
2025-03-10 20:38:28 -07:00
Courtney Pacheco
ff853ccc38
fix: Use --with-editable to capture accurate code coverage reporting (#1532)
# What does this PR do?
I created a PR earlier today, but I realized the code coverage reporting
isn't correct: #1512

Essentially, we need to use `--with-editable` to enable develop/editable
mode through `uv`. Using editable mode will create a package.egg-link
file, and that allows pytest to accurately capture code coverage.

Before, some files had "0%" or "100%" coverage, which isn't accurate:

<img width="1455" alt="Screenshot 2025-03-10 at 10 01 53 AM"
src="https://github.com/user-attachments/assets/c425515a-9ecd-4962-a2d4-18cd16d12f25"
/>

More info on `--with-editable`:
https://docs.astral.sh/uv/reference/cli/#uv-run--with-editable

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
Tested locally

<img width="775" alt="Screenshot 2025-03-10 at 7 00 14 PM"
src="https://github.com/user-attachments/assets/31141318-5cf6-4666-8676-b5d8c8d2e719"
/>

Screenshot from CI:

<img width="1000" alt="Screenshot 2025-03-10 at 7 07 57 PM"
src="https://github.com/user-attachments/assets/47092909-ff8d-4e97-80dc-2a16d948405a"
/>

[//]: # (## Documentation)

Signed-off-by: Courtney Pacheco <6019922+courtneypacheco@users.noreply.github.com>
2025-03-10 19:30:28 -04:00
Ashwin Bharambe
dc84bc755a
fix: revert to using faiss for ollama distro (#1530)
This is unfortunate because `sqlite-vec` seems promising. But its PIP
package is not quite complete. It does not have binary for arm64 (I
think, or maybe it even lacks 64 bit builds?) which results in the arm64
container resulting in
```
File "/usr/local/lib/python3.10/site-packages/sqlite_vec/init.py", line 17, in load
    conn.load_extension(loadable_path())
sqlite3.OperationalError: /usr/local/lib/python3.10/site-packages/sqlite_vec/vec0.so: wrong ELF class: ELFCLASS32
```

To get around I tried to install from source via `uv pip install
sqlite-vec --no-binary=sqlite-vec` however it even lacks a source
distribution which makes that impossible.

## Test Plan

Build the container locally using: 

```bash
LLAMA_STACK_DIR=. llama stack build --template ollama --image-type container
```

Run the container as: 

```
podman run --privileged -it -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
   -v ~/.llama:/root/.llama \
    --env INFERENCE_MODEL=$INFERENCE_MODEL \
    --env OLLAMA_URL=http://host.containers.internal:11434 \
    -v ~/local/llama-stack:/app/llama-stack-source 
    localhost/distribution-ollama:dev --port $LLAMA_STACK_PORT
```

Verify the container starts up correctly. Without this patch, it would
encounter the ELFCLASS32 error.
2025-03-10 16:15:17 -07:00
Sébastien Han
21e39633d8
feat(server): Use system packages for execution (#1252)
# What does this PR do?

Users prefer to rely on the main CLI rather than invoking the server
through a Python module. Users interact with a high-level CLI rather
than needing to know internal module structures.

Now, when running llama stack run <path-to-config>, the server will
attempt to use the system package or a virtual environment if one is
active.

This also eliminates the current process dependency chain when running
from a virtual environment:

-> llama stack run
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -> start_env.sh

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
-> python -m server...

Signed-off-by: Sébastien Han <seb@redhat.com>

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Run:

```
ollama run llama3.2:3b-instruct-fp16 --keepalive=2m &
llama stack run ./llama_stack/templates/ollama/run.yaml --disable-ipv6
```

Notice that the server starts and shutdowns normally.

[//]: # (## Documentation)

---------

Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-03-10 16:01:03 -07:00
Reid
feacf89548
docs: improve integration test doc (#1502)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

It should use `export` for env var for api key.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-03-10 15:50:46 -07:00
Sébastien Han
91b1b92908
build: revamp "test" dependencies from pyproject (#1468)
# What does this PR do?

The `test` section has been updated to include only the essential
dependencies needed for running integration tests, which are shared
across all providers. If a provider requires additional dependencies,
please add them to your environment separately. When using uv to
run your tests, you can specify extra dependencies with the
`--with` flag.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-10 15:43:16 -07:00
Sébastien Han
201a7567ef
test: add inspect unit test (#1417)
# What does this PR do?

Add unit tests for the inspect endpoint.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

$ ollama run llama3.2:3b-instruct-fp16 --keepalive=60m &
$ LLAMA_STACK_CONFIG=./llama_stack/templates/ollama/run.yaml uv run
pytest -v -s tests/integration/inspect/test_inspect.py

/Users/leseb/Documents/AI/llama-stack/.venv/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207:
PytestDeprecationWarning: The configuration option
"asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the
fixture caching scope. Future versions of pytest-asyncio will default
the loop scope for asynchronous fixtures to function scope. Set the
default fixture loop scope explicitly in order to avoid unexpected
behavior in the future. Valid fixture loop scopes are: "function",
"class", "module", "package", "session"


warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
============================================== test session starts
==============================================
platform darwin -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 --
/Users/leseb/Documents/AI/llama-stack/.venv/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform':
'macOS-15.3.1-arm64-arm-64bit', 'Packages': {'pytest': '8.3.4',
'pluggy': '1.5.0'}, 'Plugins': {'html': '4.1.1', 'metadata': '3.1.1',
'asyncio': '0.25.3', 'anyio': '4.8.0', 'nbval': '0.11.0'}}
rootdir: /Users/leseb/Documents/AI/llama-stack
configfile: pyproject.toml
plugins: html-4.1.1, metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0,
nbval-0.11.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 2 items


tests/integration/inspect/test_inspect.py::TestInspect::test_health[txt=8B]
PASSED

tests/integration/inspect/test_inspect.py::TestInspect::test_version[txt=8B]
PASSED

========================================= 2 passed, 3 warnings in 2.26s
===================================
```

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-10 15:36:18 -07:00
Charlie Doern
7559b4055e
chore: add color to Env Variable message (#1525)
# What does this PR do?

currently the `"Environment variable LLAMA_STACK_LOGGING found"` message
is printed with no color switch to cprint and highlight in yellow for
visibility

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-03-10 15:29:40 -07:00
Ihar Hrachyshka
a64021bb47
fix: Disable async loop warning messages during test run (#1526)
# What does this PR do?

The test class by default enables debug mode, which produces some
unexpected warnings like:

```
tests/unit/models/test_prompt_adapter.py::PrepareMessagesTests::test_completion_message_encoding
WARNING  2025-03-10 20:41:48,577 asyncio:1904 uncategorized: Executing <Task pending name='Task-1'
  coro=<IsolatedAsyncioTestCase._asyncioLoopRunner() running at
  /home/ec2-user/.local/share/uv/python/cpython-3.10.16-linux-x86_64-gnu/lib/python3.10/unittest/async_case.py:95
  > wait_for=<Future pending cb=[Task.task_wakeup()] created at
  /home/ec2-user/.local/share/uv/python/cpython-3.10.16-linux-x86_64-gnu/lib/python3.10/asyncio/base_events.py:42
  9> created at
  /home/ec2-user/.local/share/uv/python/cpython-3.10.16-linux-x86_64-gnu/lib/python3.10/unittest/async_case.py:11
  7> took 0.231 seconds
PASSED
```

I suggest we disable these since they are not very useful and can
confuse other developers.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Run tests. The warnings are no longer seen.

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-10 15:29:08 -07:00
ehhuang
0e3c0cf8de
fix: server logging (#1521)
Summary:

Test Plan:

ERROR 2025-03-10 10:53:00,804 __main__:239 server: Error executing
endpoint route='/v1/inference/chat-completion'
         method='post'
2025-03-10 15:25:23 -07:00
Sarthak Deshpande
921f8b1125
chore: Together async client (#1510)
# What does this PR do?
Uses together async client instead of sync client

[//]: # (If resolving an issue, uncomment and update the line below)

## Test Plan
Command to run the test is in the image below(2 tests fail, and they
were failing for the old stable version as well with the same errors.)
<img width="1689" alt="image"
src="https://github.com/user-attachments/assets/503db720-5379-425d-9844-0225010e41a1"
/>


[//]: # (## Documentation)

---------

Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>
2025-03-10 15:25:01 -07:00
Ashwin Bharambe
bc8daf7fea
fix: include jinja2 as a core llama-stack dependency (#1529)
We removed `llama-models` as a dep which was pulling this in for us
previously. This did not get caught in the release process because the
distros we use for testing (fireworks / together) pull that in via
sentence transformers which we don't use in all distros (notably
ollama.)

See #1511 

## Test Plan

Ran `llama-stack-ops/actions/test-and-cut/main.sh` with
`ONLY_TEST_DONT_CUT=1 COMMIT_ID=origin/fix_jinja2` and by making it
build the ollama docker. Ran the docker to ensure it does not error out
with jinja2 dependency error. (Unfortunately there is another error with
sqlite_vec there.)
2025-03-10 14:59:11 -07:00
James Kunstle
735892cbd2
refactor: ImageType to LlamaStackImageType (#1500)
This disambiguates "Image" term from "container image" alternative usage
and allows for:

```python

if image_type == LlamaStackImagetype.venv:
	...

```

accesses rather than `ImageType.venv.value`

# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

Changes enum use to comply with semantic python styling and naming
conventions.

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

Refactor was automated and small so simple run-through of creating
images was done.

Signed-off-by: James Kunstle <jkunstle@redhat.com>
2025-03-10 17:12:53 -04:00
Courtney Pacheco
6dbac3beed
chore: Display code coverage for unit tests in PR builds (#1512)
# What does this PR do?
This PR allows for unit test code coverage % to be reported in PR
builds. Currently, today's output tells the end user which tests passed
and which tests failed:

<img width="744" alt="Screenshot 2025-03-10 at 9 44 28 AM"
src="https://github.com/user-attachments/assets/40b1a578-951f-4b74-8a37-a39c039b1d7e"
/>

If a contributor is creating a new module within Llama Stack and starts
writing unit tests for that module, it might be difficult for Llama
Stack maintainers to immediately determine the code coverage percentage
for that new module.

To allow for code coverage reporting in the CI, we simply need to
install `pytest-cov` so we can use the `--cov` flag with the existing
`pytest` command.

Ideally, it would be nicer to have a bot report code coverage, but this
PR can be a temporary solution.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
I ran these changes locally:

<img width="1455" alt="Screenshot 2025-03-10 at 10 01 53 AM"
src="https://github.com/user-attachments/assets/dfd765c6-5979-42a3-b899-7713a3f202e6"
/>

PR build to confirm the expected behavior:
<img width="1326" alt="Screenshot 2025-03-10 at 12 47 36 PM"
src="https://github.com/user-attachments/assets/fe94f1e6-fbb5-4e57-9902-197502c50621"
/>


[//]: # (## Documentation)

Signed-off-by: Courtney Pacheco <6019922+courtneypacheco@users.noreply.github.com>
2025-03-10 16:27:33 -04:00
Reid
0b8cb830b9
docs: update ollama doc url (#1508)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

It should changed in this pr
https://github.com/meta-llama/llama-stack/pull/1190/files#diff-53e3f35ced54ee5e57dc8b0d3b04770ed84f2f6434c6f492f42569b3c2810ecd

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-03-10 13:04:59 -07:00
Xi Yan
23278d1e5d
fix: update getting_started structured decoding cell (#1523)
# What does this PR do?

- Together's inference only supports 3.1 for structured decoding

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
pytest -v -s --nbval-lax ./docs/getting_started.ipynb
```

[//]: # (## Documentation)
2025-03-10 13:03:57 -07:00
Reid
8814111da1
docs: improve eval doc (#1501)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-03-10 11:38:07 -07:00
ehhuang
d045b8830f
docs: update prompt for websearch example (#1520)
Summary:
model is sometimes reluctant to use tools by default.

Test Plan:
run in notebook
2025-03-10 10:42:05 -07:00
Sarthak Deshpande
a9c5d3cd3d
chore: made inbuilt tools blocking calls into async non blocking calls (#1509)
# What does this PR do?
This PR converts blocking calls for in built tools like wolfram, brave,
tavily and bing into non blocking async calls
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
pytest -s -v tool_runtime/test_builtin_tools.py --stack-config=together
--text-model=meta-llama/Llama-3.1-8B-Instruct
Used the command above to get the below results
<img width="1710" alt="image"
src="https://github.com/user-attachments/assets/76b0ca06-f6e4-45fa-a114-0449bef2325b"
/>


<img width="1389" alt="image"
src="https://github.com/user-attachments/assets/5220ccbb-7882-4240-b17e-f362ad46d25b"
/>

<img width="1432" alt="image"
src="https://github.com/user-attachments/assets/bb93a41e-e82a-4c98-a22d-6b0e320aa974"
/>

[//]: # (## Documentation)

---------

Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>
2025-03-09 16:59:24 -07:00
Ashwin Bharambe
70ff226b6a fix(library_client): ensure pending asyncio tasks like generator athrow are executed 2025-03-09 16:17:27 -07:00
Ashwin Bharambe
ba917a9c48 fix: make sure readthedocs is triggered if pyproject.toml is updated 2025-03-08 23:05:10 -08:00
Ashwin Bharambe
205661bc78
fix: Use re-entrancy and concurrency safe context managers for provider data (#1498)
Concurrent requests should not trample (or reuse) each others' provider
data. Provider data should be scoped to each request.

## Test Plan

Set the uvicorn server to have a single worker process + thread by
updating the config:
```python
    uvicorn_config = {
        ...
        "workers": 1,
        "loop": "asyncio",
    }
```

Then perform the following steps on `origin/main` (without this change).

(1) Run the server using `llama stack run dev` without having
`FIREWORKS_API_KEY` in the environment.

(2) Run a test by specifying the FIREWORKS_API_KEY env var so it gets
stored in the thread local
```
pytest -s -v tests/integration/inference/test_text_inference.py \
    --stack-config http://localhost:8321 \
    --text-model accounts/fireworks/models/llama-v3p1-8b-instruct \
    -k test_text_chat_completion_with_tool_calling_and_streaming \
     --env FIREWORKS_API_KEY=<...>
``` 
Ensure you don't have any other API keys in the environment (otherwise
the bug will not reproduce due to other specifics in our testing code.)
Verify this works.

(3) Run the same command again without specifying FIREWORKS_API_KEY. See
that the request actually succeeds when it *should have failed*.


----
Now do the same tests on this branch, verify step (3) results in
failure.

Finally, run the full `test_text_inference.py` test suite with this
change, verify it succeeds.
2025-03-08 22:56:30 -08:00
Yuan Tang
6033e6893e
docs: Add v0.1.6 release notes to changelog (#1506)
# What does this PR do?

Adds v0.1.6 release notes to changelog.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-08 16:20:08 -08:00
Ashwin Bharambe
0db3a2f511 fix: run pre-commit due to release script bumps 2025-03-07 16:31:42 -08:00
github-actions[bot]
c4e527b21c Bump version to 0.1.6 2025-03-08 00:25:40 +00:00
ehhuang
23e39cc3c4
fix: handle log errors (#1499)
Summary:
| File
"/Users/erichuang/projects/llama-stack/llama_stack/distribution/server/server.py",
line 213, in sse_generator
    |     logger.exception(f"Error in sse_generator: {e}")
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 1864, in exception
    |     self.log(ERROR, msg, *args, exc_info=exc_info, **kwargs)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 1879, in log
    |     self.logger.log(level, msg, *args, **kwargs)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 1547, in log
    |     self._log(level, msg, args, **kwargs)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 1624, in _log
    |     self.handle(record)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 1634, in handle
    |     self.callHandlers(record)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 1696, in callHandlers
    |     hdlr.handle(record)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py",
line 968, in handle
    |     self.emit(record)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py",
line 167, in emit
    |     message_renderable = self.render_message(record, message)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py",
line 193, in render_message
| message_text = Text.from_markup(message) if use_markup else
Text(message)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/text.py",
line 287, in from_markup
| rendered_text = render(text, style, emoji=emoji,
emoji_variant=emoji_variant)
| File
"/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/markup.py",
line 167, in render
    |     raise MarkupError(
| rich.errors.MarkupError: closing tag '[/INST]' at position 105 doesn't
match any open tag


Test Plan:
reran failing rag_with_vector_db example
2025-03-07 15:58:26 -08:00
Botao Chen
ade76e4a69
fix: update the open benchmark eval doc (#1497)
## What does this PR do?
add proper links to the doc

## test
preview the doc 

<img width="1304" alt="Screenshot 2025-03-07 at 3 03 22 PM"
src="https://github.com/user-attachments/assets/0a0e2a3d-2420-4af0-99c3-a4786855fae0"
/>

<img width="1303" alt="Screenshot 2025-03-07 at 3 03 32 PM"
src="https://github.com/user-attachments/assets/e11844e7-ee8a-4a64-8617-abafa02b2868"
/>
2025-03-07 15:05:27 -08:00
Botao Chen
89e449c2cb
fix: Fix open benchmark template (#1496)
## What does this PR do?
Delete the open_benchmark template which was generated by the auto
codegen by accident
2025-03-07 14:49:10 -08:00
dependabot[bot]
d63e798f6d
build(deps): bump thollander/actions-comment-pull-request from 2 to 3 (#1485) 2025-03-07 17:31:53 -05:00
dependabot[bot]
9506012736
build(deps): bump actions/upload-artifact from 3 to 4 (#1486) 2025-03-07 17:31:00 -05:00
Xi Yan
9028407386
fix: clean up detailed history for CHANGELOG (#1494)
# What does this PR do?

- do not dump all commit history in CHANGELOG
cc @terrytangyuan 

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
python scripts/gen-changelog.py
```

[//]: # (## Documentation)
2025-03-07 14:03:54 -08:00
ehhuang
3b4f3a6b15
test: update recorded fixtures (#1493)
Summary:

Test Plan:
2025-03-07 13:58:38 -08:00
ehhuang
b0cc38b269
test: fix recordable mocks cache key (#1492)
Summary:

CI writes files to /tmp

[{"__module__": "llama_stack.apis.inference.inference", "__pydantic__":
"SystemMessage", "data": {"content": "You are a helpful assistant",
"role": "system"}}, {"__module__":
"llama_stack.apis.inference.inference", "__pydantic__": "UserMessage",
"data": {"content": "Here is a csv file, can you describe it?",
"context": null, "role": "user"}}, {"__module__":
"llama_stack.apis.inference.inference", "__pydantic__":
"ToolResponseMessage", "data": {"call_id": "", "content": [{"text": "#
User provided a file accessible to you at
\\"/tmp/tmp7k7dg6qk/gcDtT5M8inflation.csv\\"\\nYou can use
code_interpreter to load and inspect it.", "type": "text"}], "role":
"tool", "tool_name": {"__enum__": "BuiltinTool", "__module__":
"llama_stack.models.llama.datatypes", "value": "code_interpreter"}}}]],
{"response_format": null, "sa

Test Plan:
2025-03-07 13:45:25 -08:00
ehhuang
a1cdace093
test: image downloading is flaky (#1491)
Summary:

Test Plan:
2025-03-07 13:39:26 -08:00
Fred Reiss
a8d0cdaf37
feat: updated inline vllm inference provider (#880)
# What does this PR do?

This PR updates the inline vLLM inference provider in several
significant ways:
* Models are now attached at run time to instances of the provider via
the `.../models` API instead of hard-coding the model's full name into
the provider's YAML configuration.
* The provider supports models that are not Meta Llama models. Any model
that vLLM supports can be loaded by passing Huggingface coordinates in
the "provider_model_id" field. Custom fine-tuned versions of Meta Llama
models can be loaded by specifying a path on local disk in the
"provider_model_id".
* To implement full chat completions support, including tool calling and
constrained decoding, the provider now routes the `chat_completions` API
to a captive (i.e. called directly in-process, not via HTTPS) instance
of vLLM's OpenAI-compatible server .
* The `logprobs` parameter and completions API are also working.

## Test Plan

Existing tests in
`llama_stack/providers/tests/inference/test_text_inference.py` have good
coverage of the new functionality. These tests can be invoked as
follows:

```
cd llama-stack && pytest \
    -vvv \
    llama_stack/providers/tests/inference/test_text_inference.py \
    --providers inference=vllm \
    --inference-model meta-llama/Llama-3.2-3B-Instruct
====================================== test session starts ======================================
platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12
cachedir: .pytest_cache
metadata: {'Python': '3.12.8', 'Platform': 'Linux-6.8.0-1016-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.2'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'}
rootdir: /mnt/datadisk1/freiss/llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.8.0, html-4.1.1, metadata-3.1.1, asyncio-0.25.2
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 9 items                                                                               

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm] PASSED [ 11%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm] PASSED [ 22%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm] PASSED [ 33%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm] PASSED [ 44%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm] PASSED [ 55%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm] PASSED [ 66%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm] PASSED [ 77%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm] PASSED [ 88%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm] PASSED [100%]

=========================== 9 passed, 13 warnings in 97.18s (0:01:37) ===========================

```

## Sources


## Before submitting

- [X] Ran pre-commit to handle lint / formatting issues.
- [X] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.

---------

Co-authored-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-03-07 13:38:23 -08:00
ehhuang
acbae66b9d
chore: escape tool output for logging (#1490)
Summary:

error:


llama_stack/providers/inline/agents/meta_reference/agent_instance.py:1032:
in execute_tool_call_maybe
    logger.info(f"tool call {name} completed with result: {result}")

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1841:
in info
    self.log(INFO, msg, *args, **kwargs)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1879:
in log
    self.logger.log(level, msg, *args, **kwargs)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1547:
in log
    self._log(level, msg, args, **kwargs)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1624:
in _log
    self.handle(record)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1634:
in handle
    self.callHandlers(record)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1696:
in callHandlers
    hdlr.handle(record)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:968:
in handle
    self.emit(record)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py:167:
in emit
    message_renderable = self.render_message(record, message)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py:193:
in render_message
message_text = Text.from_markup(message) if use_markup else
Text(message)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/text.py:287:
in from_markup
rendered_text = render(text, style, emoji=emoji,
emoji_variant=emoji_variant)

/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/markup.py:167:
in render
    raise MarkupError(
E rich.errors.MarkupError: closing tag '[/INST]' at position 3274
doesn't match any open tag

Test Plan:
2025-03-07 13:33:45 -08:00
Xi Yan
a55aab5958
fix: fix scoring tests (#1487)
# What does this PR do?
- fix scoring test

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/scoring/test_scoring.py --text-model meta-llama/Llama-3.3-70B-Instruct --judge-model meta-llama/Llama-3.3-70B-Instruct
```

<img width="1061" alt="image"
src="https://github.com/user-attachments/assets/740f9e6e-a654-4265-9db1-61481515a852"
/>


[//]: # (## Documentation)
2025-03-07 13:13:41 -08:00
Sébastien Han
e6355bfc3b
ci: enable Dependabot for GitHub Actions (#1470)
# What does this PR do?

Add a Dependabot configuration file (.github/dependabot.yml) to enable
automated dependency updates for GitHub Actions. This ensures workflows
stay up to date with the latest versions, improving security and
reliability.

Dependabot is configured to:
- Monitor GitHub Actions dependencies.
- Check for updates in the workflow directory
- Run updates on a daily schedule.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-07 12:54:56 -08:00
Xi Yan
5a2b9e121c
fix: return result for together's get_params (#1484)
# What does this PR do?

- return results for together's get_params
- fix issue
<img width="1538" alt="image"
src="https://github.com/user-attachments/assets/c4cd3802-85ef-4ff3-b2fd-76737be2e4ff"
/>

- the `return params` was accidentally deleted in
https://github.com/meta-llama/llama-stack/pull/1362/files#diff-d9345410ea64589cee96487b22eab0d45f7497a80c25dca295cecd254decb204

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
npm test examples
```


[//]: # (## Documentation)
2025-03-07 12:52:26 -08:00
ehhuang
1257288361
build: add 'tiktoken' to deps (#1483)
Summary:

Test Plan:
2025-03-07 12:36:02 -08:00
ehhuang
124e8d7cfe
build: include .md (#1482)
Summary:

Test Plan:
2025-03-07 12:10:52 -08:00
Ben Browning
d86a893ead
fix: Swap to AsyncOpenAI client in remote vllm provider (#1459)
# What does this PR do?

This switches from an OpenAI client to the AsyncOpenAI client in the
remote vllm provider. The main benefit of this is that instead of each
client call being a blocking operation that was blocking our server
event loop, the client calls are now async operations that do not block
the event loop.

The actual fix is quite simple and straightforward. Creating a reliable
reproducer of this with a unit test that verifies we were blocking the
event loop before and are not blocking it any longer was a bit harder.
Some other inference providers have this same issue, so we may want to
make that simple delayed http server a bit more generic and pull it into
a common place as other inference providers get fixed.

(Closes #1457)

## Test Plan

I verified the unit tests and test_text_inference tests pass with this
change like below:

```
python -m pytest -v tests/unit
```

```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v -s \
tests/integration/inference/test_text_inference.py \
--text-model "meta-llama/Llama-3.2-3B-Instruct"
```

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-03-07 14:48:00 -05:00
ehhuang
256448c14e
fix(cli): llama model prompt-format (#1481)
Summary:

+ llama model prompt-format -m Llama3.2-11B-Vision-Instruct
Traceback (most recent call last):
  File "/tmp/tmp.gCwyyCcjoA/.venv/bin/llama", line 10, in <module>
    sys.exit(main())
File
"/tmp/tmp.gCwyyCcjoA/.venv/lib/python3.10/site-packages/llama_stack/cli/llama.py",
line 50, in main
    parser.run(args)
File
"/tmp/tmp.gCwyyCcjoA/.venv/lib/python3.10/site-packages/llama_stack/cli/llama.py",
line 44, in run
    args.func(args)
File
"/tmp/tmp.gCwyyCcjoA/.venv/lib/python3.10/site-packages/llama_stack/cli/model/prompt_format.py",
line 59, in _run_model_template_cmd
    if args.list:
AttributeError: 'Namespace' object has no attribute 'list'

Test Plan:
llama model prompt-format -m Llama3.2-11B-Vision-Instruct
2025-03-07 11:45:54 -08:00
Sébastien Han
ffa32af930
build: bump llama-stack-client version (#1469)
## What does this PR do?

Use 0.1.5.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-07 11:42:38 -08:00