# What does this PR do?
- do not dump all commit history in CHANGELOG
cc @terrytangyuan
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
python scripts/gen-changelog.py
```
[//]: # (## Documentation)
Summary:
CI writes files to /tmp
[{"__module__": "llama_stack.apis.inference.inference", "__pydantic__":
"SystemMessage", "data": {"content": "You are a helpful assistant",
"role": "system"}}, {"__module__":
"llama_stack.apis.inference.inference", "__pydantic__": "UserMessage",
"data": {"content": "Here is a csv file, can you describe it?",
"context": null, "role": "user"}}, {"__module__":
"llama_stack.apis.inference.inference", "__pydantic__":
"ToolResponseMessage", "data": {"call_id": "", "content": [{"text": "#
User provided a file accessible to you at
\\"/tmp/tmp7k7dg6qk/gcDtT5M8inflation.csv\\"\\nYou can use
code_interpreter to load and inspect it.", "type": "text"}], "role":
"tool", "tool_name": {"__enum__": "BuiltinTool", "__module__":
"llama_stack.models.llama.datatypes", "value": "code_interpreter"}}}]],
{"response_format": null, "sa
Test Plan:
# What does this PR do?
This PR updates the inline vLLM inference provider in several
significant ways:
* Models are now attached at run time to instances of the provider via
the `.../models` API instead of hard-coding the model's full name into
the provider's YAML configuration.
* The provider supports models that are not Meta Llama models. Any model
that vLLM supports can be loaded by passing Huggingface coordinates in
the "provider_model_id" field. Custom fine-tuned versions of Meta Llama
models can be loaded by specifying a path on local disk in the
"provider_model_id".
* To implement full chat completions support, including tool calling and
constrained decoding, the provider now routes the `chat_completions` API
to a captive (i.e. called directly in-process, not via HTTPS) instance
of vLLM's OpenAI-compatible server .
* The `logprobs` parameter and completions API are also working.
## Test Plan
Existing tests in
`llama_stack/providers/tests/inference/test_text_inference.py` have good
coverage of the new functionality. These tests can be invoked as
follows:
```
cd llama-stack && pytest \
-vvv \
llama_stack/providers/tests/inference/test_text_inference.py \
--providers inference=vllm \
--inference-model meta-llama/Llama-3.2-3B-Instruct
====================================== test session starts ======================================
platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12
cachedir: .pytest_cache
metadata: {'Python': '3.12.8', 'Platform': 'Linux-6.8.0-1016-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.2'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'}
rootdir: /mnt/datadisk1/freiss/llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.8.0, html-4.1.1, metadata-3.1.1, asyncio-0.25.2
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 9 items
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm] PASSED [ 11%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm] PASSED [ 22%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm] PASSED [ 33%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm] PASSED [ 44%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm] PASSED [ 55%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm] PASSED [ 66%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm] PASSED [ 77%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm] PASSED [ 88%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm] PASSED [100%]
=========================== 9 passed, 13 warnings in 97.18s (0:01:37) ===========================
```
## Sources
## Before submitting
- [X] Ran pre-commit to handle lint / formatting issues.
- [X] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
---------
Co-authored-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Summary:
error:
llama_stack/providers/inline/agents/meta_reference/agent_instance.py:1032:
in execute_tool_call_maybe
logger.info(f"tool call {name} completed with result: {result}")
/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1841:
in info
self.log(INFO, msg, *args, **kwargs)
/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1879:
in log
self.logger.log(level, msg, *args, **kwargs)
/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1547:
in log
self._log(level, msg, args, **kwargs)
/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1624:
in _log
self.handle(record)
/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1634:
in handle
self.callHandlers(record)
/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:1696:
in callHandlers
hdlr.handle(record)
/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/logging/__init__.py:968:
in handle
self.emit(record)
/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py:167:
in emit
message_renderable = self.render_message(record, message)
/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/logging.py:193:
in render_message
message_text = Text.from_markup(message) if use_markup else
Text(message)
/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/text.py:287:
in from_markup
rendered_text = render(text, style, emoji=emoji,
emoji_variant=emoji_variant)
/opt/homebrew/Caskroom/miniconda/base/envs/myenv/lib/python3.10/site-packages/rich/markup.py:167:
in render
raise MarkupError(
E rich.errors.MarkupError: closing tag '[/INST]' at position 3274
doesn't match any open tag
Test Plan:
# What does this PR do?
- fix scoring test
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/scoring/test_scoring.py --text-model meta-llama/Llama-3.3-70B-Instruct --judge-model meta-llama/Llama-3.3-70B-Instruct
```
<img width="1061" alt="image"
src="https://github.com/user-attachments/assets/740f9e6e-a654-4265-9db1-61481515a852"
/>
[//]: # (## Documentation)
# What does this PR do?
Add a Dependabot configuration file (.github/dependabot.yml) to enable
automated dependency updates for GitHub Actions. This ensures workflows
stay up to date with the latest versions, improving security and
reliability.
Dependabot is configured to:
- Monitor GitHub Actions dependencies.
- Check for updates in the workflow directory
- Run updates on a daily schedule.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This switches from an OpenAI client to the AsyncOpenAI client in the
remote vllm provider. The main benefit of this is that instead of each
client call being a blocking operation that was blocking our server
event loop, the client calls are now async operations that do not block
the event loop.
The actual fix is quite simple and straightforward. Creating a reliable
reproducer of this with a unit test that verifies we were blocking the
event loop before and are not blocking it any longer was a bit harder.
Some other inference providers have this same issue, so we may want to
make that simple delayed http server a bit more generic and pull it into
a common place as other inference providers get fixed.
(Closes#1457)
## Test Plan
I verified the unit tests and test_text_inference tests pass with this
change like below:
```
python -m pytest -v tests/unit
```
```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=remote-vllm \
python -m pytest -v -s \
tests/integration/inference/test_text_inference.py \
--text-model "meta-llama/Llama-3.2-3B-Instruct"
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Summary:
+ llama model prompt-format -m Llama3.2-11B-Vision-Instruct
Traceback (most recent call last):
File "/tmp/tmp.gCwyyCcjoA/.venv/bin/llama", line 10, in <module>
sys.exit(main())
File
"/tmp/tmp.gCwyyCcjoA/.venv/lib/python3.10/site-packages/llama_stack/cli/llama.py",
line 50, in main
parser.run(args)
File
"/tmp/tmp.gCwyyCcjoA/.venv/lib/python3.10/site-packages/llama_stack/cli/llama.py",
line 44, in run
args.func(args)
File
"/tmp/tmp.gCwyyCcjoA/.venv/lib/python3.10/site-packages/llama_stack/cli/model/prompt_format.py",
line 59, in _run_model_template_cmd
if args.list:
AttributeError: 'Namespace' object has no attribute 'list'
Test Plan:
llama model prompt-format -m Llama3.2-11B-Vision-Instruct
## What does this PR do?
Use 0.1.5.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This commit introduces a new logging system that allows loggers to be
assigned
a category while retaining the logger name based on the file name. The
log
format includes both the logger name and the category, producing output
like:
```
INFO 2025-03-03 21:44:11,323 llama_stack.distribution.stack:103 [core]: Tool_groups: builtin::websearch served by
tavily-search
```
Key features include:
- Category-based logging: Loggers can be assigned a category (e.g.,
"core", "server") when programming. The logger can be loaded like
this: `logger = get_logger(name=__name__, category="server")`
- Environment variable control: Log levels can be configured
per-category using the
`LLAMA_STACK_LOGGING` environment variable. For example:
`LLAMA_STACK_LOGGING="server=DEBUG;core=debug"` enables DEBUG level for
the "server"
and "core" categories.
- `LLAMA_STACK_LOGGING="all=debug"` sets DEBUG level globally for all
categories and
third-party libraries.
This provides fine-grained control over logging levels while maintaining
a clean and
informative log format.
The formatter uses the rich library which provides nice colors better
stack traces like so:
```
ERROR 2025-03-03 21:49:37,124 asyncio:1758 [uncategorized]: unhandled exception during asyncio.run() shutdown
task: <Task finished name='Task-16' coro=<handle_signal.<locals>.shutdown() done, defined at
/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:146>
exception=UnboundLocalError("local variable 'loop' referenced before assignment")>
╭────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────╮
│ /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:178 in shutdown │
│ │
│ 175 │ │ except asyncio.CancelledError: │
│ 176 │ │ │ pass │
│ 177 │ │ finally: │
│ ❱ 178 │ │ │ loop.stop() │
│ 179 │ │
│ 180 │ loop = asyncio.get_running_loop() │
│ 181 │ loop.create_task(shutdown()) │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
UnboundLocalError: local variable 'loop' referenced before assignment
```
Co-authored-by: Ashwin Bharambe <@ashwinb>
Signed-off-by: Sébastien Han <seb@redhat.com>
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml
INFO 2025-03-03 21:55:35,918 __main__:365 [server]: Using config file: llama_stack/templates/ollama/run.yaml
INFO 2025-03-03 21:55:35,925 __main__:378 [server]: Run configuration:
INFO 2025-03-03 21:55:35,928 __main__:380 [server]: apis:
- agents
```
[//]: # (## Documentation)
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Since we moved the move tests/client-sdk to tests/api in
https://github.com/meta-llama/llama-stack/pull/1376. The N999 rule is
not needed anymore. And furthermore in
abfbaf3c1b
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
using `formatter_class=argparse.ArgumentDefaultsHelpFormatter` displays
(default: DEFAULT_VALUE) for each flag. add this formatter class to
build and run to show users some default values like `conda`, `8321`,
etc
## Test Plan
ran locally with following output:
before:
```
llama stack run --help
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--disable-ipv6] [--env KEY=VALUE] [--tls-keyfile TLS_KEYFILE] [--tls-certfile TLS_CERTFILE]
[--image-type {conda,container,venv}]
config
Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.
positional arguments:
config Path to config file to use for the run
options:
-h, --help show this help message and exit
--port PORT Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. Defaults to 8321
--image-name IMAGE_NAME
Name of the image to run. Defaults to the current conda environment
--disable-ipv6 Disable IPv6 support
--env KEY=VALUE Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times.
--tls-keyfile TLS_KEYFILE
Path to TLS key file for HTTPS
--tls-certfile TLS_CERTFILE
Path to TLS certificate file for HTTPS
--image-type {conda,container,venv}
Image Type used during the build. This can be either conda or container or venv.
```
after:
```
llama stack run --help
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--disable-ipv6] [--env KEY=VALUE] [--tls-keyfile TLS_KEYFILE] [--tls-certfile TLS_CERTFILE]
[--image-type {conda,container,venv}]
config
Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.
positional arguments:
config Path to config file to use for the run
options:
-h, --help show this help message and exit
--port PORT Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. (default: 8321)
--image-name IMAGE_NAME
Name of the image to run. Defaults to the current conda environment (default: None)
--disable-ipv6 Disable IPv6 support (default: False)
--env KEY=VALUE Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times. (default: [])
--tls-keyfile TLS_KEYFILE
Path to TLS key file for HTTPS (default: None)
--tls-certfile TLS_CERTFILE
Path to TLS certificate file for HTTPS (default: None)
--image-type {conda,container,venv}
Image Type used during the build. This can be either conda or container or venv. (default: conda)
```
[//]: # (## Documentation)
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Ignores `pytest-report.xml`. The file is produced by the unit tests
github workflow.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
Not needed.
[//]: # (## Documentation)
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
Based on the client output changed, so the output is incorrect:
458e20702b/src/llama_stack_client/lib/cli/models/models.py (L52)
and
https://github.com/meta-llama/llama-stack/pull/1348#pullrequestreview-2654971315
previous discussion that no need to maintain the output, so remove it.
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
See https://github.com/meta-llama/llama-stack/pull/1171 which is the
original PR. Author: @zc277584121
feat: add [Milvus](https://milvus.io/) vectorDB
note: I use the MilvusClient to implement it instead of
AsyncMilvusClient, because when I tested AsyncMilvusClient, it would
raise issues about evenloop, which I think AsyncMilvusClient SDK is not
robust enough to be compatible with llama_stack framework.
## Test Plan
have passed the unit test and ene2end test
Here is my end2end test logs, including the client code, client log,
server logs from inline and remote settings
[test_end2end_logs.zip](https://github.com/user-attachments/files/18964391/test_end2end_logs.zip)
---------
Signed-off-by: ChengZi <chen.zhang@zilliz.com>
Co-authored-by: Cheney Zhang <chen.zhang@zilliz.com>
# What does this PR do?
- re-gen to fix agents test
- update test_custom_tool
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/agents/test_agents.py --text-model meta-llama/Llama-3.3-70B-Instruct
```
<img width="1294" alt="image"
src="https://github.com/user-attachments/assets/63521532-b989-4cf2-8fe5-c7f057f1c4dc"
/>
[//]: # (## Documentation)
# What does this PR do?
Fix import errors due to `chardet` and `pypdf` not being installed while
imported from `url_utils.py`.
Closes#1432
## Test Plan
Now able to run the server with the config.
[//]: # (## Documentation)
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
The commit addresses the Ruff warning B008 by refactoring the code to
avoid calling SamplingParams() directly in function argument defaults.
Instead, it either uses Field(default_factory=SamplingParams) for
Pydantic models or sets the default to None and instantiates
SamplingParams inside the function body when the argument is None.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
- add eval concept doc in Core Concept tab
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
<img width="1266" alt="image"
src="https://github.com/user-attachments/assets/8eb06a49-3c04-4899-805c-1b5349471f1f"
/>
cc @SLR722
[//]: # (## Documentation)
# Summary:
removes the use of pickle
# Test Plan:
Run the following with `--record-responses` first, then another time
without.
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
- update to use library client throughout
cc @jeffxtang
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
pytest -v -s --nbval-lax ./docs/getting_started.ipynb
```
[//]: # (## Documentation)
# What does this PR do?
The method "dict" in class "BaseModel" is deprecated we should use
model_dump instead.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
@raghotham @ashwinb @yanxi0830 This adds a single changelog doc for
easier browsing based on our previous discussions.
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
currently logcat is not documented for build && run. Add documentation
in building_distro.md
Signed-off-by: Charlie Doern <cdoern@redhat.com>
You now run the integration tests with these options:
```bash
Custom options:
--stack-config=STACK_CONFIG
a 'pointer' to the stack. this can be either be:
(a) a template name like `fireworks`, or
(b) a path to a run.yaml file, or
(c) an adhoc config spec, e.g.
`inference=fireworks,safety=llama-guard,agents=meta-
reference`
--env=ENV Set environment variables, e.g. --env KEY=value
--text-model=TEXT_MODEL
comma-separated list of text models. Fixture name:
text_model_id
--vision-model=VISION_MODEL
comma-separated list of vision models. Fixture name:
vision_model_id
--embedding-model=EMBEDDING_MODEL
comma-separated list of embedding models. Fixture name:
embedding_model_id
--safety-shield=SAFETY_SHIELD
comma-separated list of safety shields. Fixture name:
shield_id
--judge-model=JUDGE_MODEL
comma-separated list of judge models. Fixture name:
judge_model_id
--embedding-dimension=EMBEDDING_DIMENSION
Output dimensionality of the embedding model to use for
testing. Default: 384
--record-responses Record new API responses instead of using cached ones.
--report=REPORT Path where the test report should be written, e.g.
--report=/path/to/report.md
```
Importantly, if you don't specify any of the models (text-model,
vision-model, etc.) the relevant tests will get **skipped!**
This will make running tests somewhat more annoying since all options
will need to be specified. We will make this easier by adding some easy
wrapper yaml configs.
## Test Plan
Example:
```bash
ashwin@ashwin-mbp ~/local/llama-stack/tests/integration (unify_tests) $
LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/test_text_inference.py \
--text-model meta-llama/Llama-3.2-3B-Instruct
```