# What does this PR do?
uvicorn has a `log_level` arg in uvicorn.run, pass in the effective
level set by the logger.
Additionally, third party libraries like httpx are using our logging
format, but not honoring our log level.
This seems unintended, so loop through all items in the loggerDict and
apply the same log level as what we have set.
## Test Plan
before:
```
llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml
Environment variable LLAMA_STACK_LOGGING found: all=warn
Using virtual environment: /Users/charliedoern/projects/Documents/llama-stack/venv
+ python -m llama_stack.distribution.server.server --yaml-config /Users/charliedoern/.llama/distributions/ollama/ollama-run.yaml --port 8321
Environment variable LLAMA_STACK_LOGGING found: all=warn
WARNING 2025-03-10 16:05:49,706 root:71 uncategorized: Warning: `bwrap` is not available. Code interpreter tool will
not work correctly.
INFO 2025-03-10 16:05:49,916 datasets:54 uncategorized: PyTorch version 2.5.1 available.
INFO 2025-03-10 16:05:50,010 httpx:1740 uncategorized: HTTP Request: GET http://localhost:11434/api/ps "HTTP/1.1 200
OK"
INFO 2025-03-10 16:05:50,297 httpx:1740 uncategorized: HTTP Request: POST http://localhost:11434/api/pull "HTTP/1.1
200 OK"
INFO 2025-03-10 16:05:50,314 httpx:1740 uncategorized: HTTP Request: GET http://localhost:11434/api/tags "HTTP/1.1
200 OK"
INFO: Started server process [89663]
INFO: Waiting for application startup.
INFO: ASGI 'lifespan' protocol appears unsupported.
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
```
after:
```
llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml
Environment variable LLAMA_STACK_LOGGING found: all=warn
Using virtual environment: /Users/charliedoern/projects/Documents/llama-stack/venv
+ python -m llama_stack.distribution.server.server --yaml-config /Users/charliedoern/.llama/distributions/ollama/ollama-run.yaml --port 8321
Environment variable LLAMA_STACK_LOGGING found: all=warn
WARNING 2025-03-10 16:05:20,429 root:71 uncategorized: Warning: `bwrap` is not available. Code interpreter tool will
not work correctly.
INFO 2025-03-10 16:05:20,639 datasets:54 uncategorized: PyTorch version 2.5.1 available.
```
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
If implementation raises CancelledError (e.g. when it runs its own async
loop for jobs), the main server shutdown handler gets confused and
doesn't attempt to shut down the main loop tasks.
While at it, also fixing the following failure when this happens:
```
UnboundLocalError: cannot access local variable 'loop' where it is not
associated with a value
```
Shutdown handlers were not running because lifespan logic was broken
since ~Oct 2024. Fixed that too and enforcing `lifespan` now (making
sure server will crash when it fails to interact with app through
middleware).
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
Spotted while working on
https://github.com/meta-llama/llama-stack/pull/1437
One way to trigger it without the PR above is to add `raise
CancelledError` in
any of the running providers' `shutdown` methods; then `kill -INT <pid>`
the
server process.
Validated this with the following test patch:
```
diff --git a/llama_stack/distribution/server/server.py b/llama_stack/distribution/server/server.py
index b85c463a..10dad83e 100644
--- a/llama_stack/distribution/server/server.py
+++ b/llama_stack/distribution/server/server.py
@@ -174,6 +174,7 @@ def handle_signal(app, signum, _) -> None:
except asyncio.CancelledError:
pass
finally:
+ logger.info("Stopping event loop")
loop.stop()
loop = asyncio.get_running_loop()
diff --git a/llama_stack/providers/inline/post_training/torchtune/post_training.py b/llama_stack/providers/inline/post_training/torchtune/post_training.py
index b837362d..163f43d8 100644
--- a/llama_stack/providers/inline/post_training/torchtune/post_training.py
+++ b/llama_stack/providers/inline/post_training/torchtune/post_training.py
@@ -3,6 +3,7 @@
#
# This source code is licensed under the terms described in the LICENSE file in
# the root directory of this source tree.
+import asyncio
from datetime import datetime
from typing import Any, Dict, Optional
@@ -43,6 +44,9 @@ class TorchtunePostTrainingImpl:
self.jobs = {}
self.checkpoints_dict = {}
+ async def shutdown(self) -> None:
+ raise asyncio.CancelledError("Shutdown")
+
async def supervised_fine_tune(
self,
job_uuid: str,
```
Without the fix:
```
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO: Shutting down
INFO: Finished server process [52099]
INFO 2025-03-07 23:25:33,548 __main__:143 server: Received signal SIGINT (2). Exiting gracefully...
INFO 2025-03-07 23:25:33,550 __main__:150 server: Shutting down DatasetsRoutingTable
INFO 2025-03-07 23:25:33,551 __main__:177 server: Stopping event loop
ERROR 2025-03-07 23:25:33,552 asyncio:1785 uncategorized: unhandled exception during asyncio.run() shutdown
task: <Task finished name='Task-12' coro=<handle_signal.<locals>.shutdown() done, defined at
/home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py:145>
exception=UnboundLocalError("cannot access local variable 'loop' where it is not associated with a value")>
╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
│ /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py:178 in shutdown │
│ │
│ 175 │ │ │ pass │
│ 176 │ │ finally: │
│ 177 │ │ │ logger.info("Stopping event loop") │
│ ❱ 178 │ │ │ loop.stop() │
│ 179 │ │
│ 180 │ loop = asyncio.get_running_loop() │
│ 181 │ loop.create_task(shutdown()) │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
UnboundLocalError: cannot access local variable 'loop' where it is not associated with a value
```
With the fix, now seeing the following messages when the server is
killed:
```
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO: Shutting down
INFO: Finished server process [50836]
INFO 2025-03-07 23:20:35,182 __main__:143 server: Received signal SIGINT (2). Exiting gracefully...
INFO 2025-03-07 23:20:35,184 __main__:149 server: Shutting down DatasetsRoutingTable
ERROR 2025-03-07 23:20:35,185 __main__:158 server: Failed to shutdown DatasetsRoutingTable: {CancelledError()}
╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
│ /usr/lib64/python3.11/asyncio/tasks.py:476 in wait_for │
│ │
│ 473 │ try: │
│ 474 │ │ # wait until the future completes or the timeout │
│ 475 │ │ try: │
│ ❱ 476 │ │ │ await waiter │
│ 477 │ │ except exceptions.CancelledError: │
│ 478 │ │ │ if fut.done(): │
│ 479 │ │ │ │ return fut.result() │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
CancelledError
During handling of the above exception, another exception occurred:
╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
│ /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py:152 in shutdown │
│ │
│ 149 │ │ │ logger.info("Shutting down %s", impl_name) │
│ 150 │ │ │ try: │
│ 151 │ │ │ │ if hasattr(impl, "shutdown"): │
│ ❱ 152 │ │ │ │ │ await asyncio.wait_for(impl.shutdown(), timeout=5) │
│ 153 │ │ │ │ else: │
│ 154 │ │ │ │ │ logger.warning("No shutdown method for %s", impl_name) │
│ 155 │ │ │ except asyncio.TimeoutError: │
│ │
│ /usr/lib64/python3.11/asyncio/tasks.py:479 in wait_for │
│ │
│ 476 │ │ │ await waiter │
│ 477 │ │ except exceptions.CancelledError: │
│ 478 │ │ │ if fut.done(): │
│ ❱ 479 │ │ │ │ return fut.result() │
│ 480 │ │ │ else: │
│ 481 │ │ │ │ fut.remove_done_callback(cb) │
│ 482 │ │ │ │ # We must ensure that the task is not running │
│ │
│ /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/routers/routing_tables.py:131 in shutdown │
│ │
│ 128 │ │ │ elif api == Api.tool_runtime: │
│ 129 │ │ │ │ p.tool_store = self │
│ 130 │ │
│ ❱ 131 │ async def shutdown(self) -> None: │
│ 132 │ │ for p in self.impls_by_provider_id.values(): │
│ 133 │ │ │ await p.shutdown() │
│ 134 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
CancelledError
INFO 2025-03-07 23:20:35,295 __main__:149 server: Shutting down DatasetIORouter
INFO 2025-03-07 23:20:35,296 __main__:149 server: Shutting down ScoringFunctionsRoutingTable
INFO 2025-03-07 23:20:35,297 __main__:149 server: Shutting down ScoringRouter
INFO 2025-03-07 23:20:35,298 __main__:149 server: Shutting down ModelsRoutingTable
INFO 2025-03-07 23:20:35,299 __main__:149 server: Shutting down InferenceRouter
INFO 2025-03-07 23:20:35,300 __main__:149 server: Shutting down ShieldsRoutingTable
INFO 2025-03-07 23:20:35,300 __main__:149 server: Shutting down SafetyRouter
INFO 2025-03-07 23:20:35,301 __main__:149 server: Shutting down VectorDBsRoutingTable
INFO 2025-03-07 23:20:35,302 __main__:149 server: Shutting down VectorIORouter
INFO 2025-03-07 23:20:35,303 __main__:149 server: Shutting down ToolGroupsRoutingTable
INFO 2025-03-07 23:20:35,304 __main__:149 server: Shutting down ToolRuntimeRouter
INFO 2025-03-07 23:20:35,304 __main__:149 server: Shutting down MetaReferenceAgentsImpl
INFO 2025-03-07 23:20:35,305 __main__:149 server: Shutting down TelemetryAdapter
INFO 2025-03-07 23:20:35,306 __main__:149 server: Shutting down TorchtunePostTrainingImpl
ERROR 2025-03-07 23:20:35,307 __main__:158 server: Failed to shutdown TorchtunePostTrainingImpl:
{CancelledError('Shutdown')}
╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
│ /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py:152 in shutdown │
│ │
│ 149 │ │ │ logger.info("Shutting down %s", impl_name) │
│ 150 │ │ │ try: │
│ 151 │ │ │ │ if hasattr(impl, "shutdown"): │
│ ❱ 152 │ │ │ │ │ await asyncio.wait_for(impl.shutdown(), timeout=5) │
│ 153 │ │ │ │ else: │
│ 154 │ │ │ │ │ logger.warning("No shutdown method for %s", impl_name) │
│ 155 │ │ │ except asyncio.TimeoutError: │
│ │
│ /usr/lib64/python3.11/asyncio/tasks.py:489 in wait_for │
│ │
│ 486 │ │ │ │ raise │
│ 487 │ │ │
│ 488 │ │ if fut.done(): │
│ ❱ 489 │ │ │ return fut.result() │
│ 490 │ │ else: │
│ 491 │ │ │ fut.remove_done_callback(cb) │
│ 492 │ │ │ # We must ensure that the task is not running │
│ │
│ /home/ec2-user/src/llama-stack/schedule/llama_stack/providers/inline/post_training/torchtune/post_training. │
│ py:48 in shutdown │
│ │
│ 45 │ │ self.checkpoints_dict = {} │
│ 46 │ │
│ 47 │ async def shutdown(self) -> None: │
│ ❱ 48 │ │ raise asyncio.CancelledError("Shutdown") │
│ 49 │ │
│ 50 │ async def supervised_fine_tune( │
│ 51 │ │ self, │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
CancelledError: Shutdown
INFO 2025-03-07 23:20:35,352 __main__:149 server: Shutting down BenchmarksRoutingTable
INFO 2025-03-07 23:20:35,353 __main__:149 server: Shutting down EvalRouter
INFO 2025-03-07 23:20:35,354 __main__:149 server: Shutting down DistributionInspectImpl
INFO 2025-03-07 23:20:35,355 __main__:177 server: Stopping event loop
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py", line 488, in <module>
main()
File "/home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py", line 476, in main
uvicorn.run(**uvicorn_config)
File "/home/ec2-user/src/llama-stack/schedule/venv/lib64/python3.11/site-packages/uvicorn/main.py", line 579, in run
server.run()
File "/home/ec2-user/src/llama-stack/schedule/venv/lib64/python3.11/site-packages/uvicorn/server.py", line 66, in run
return asyncio.run(self.serve(sockets=sockets))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.11/asyncio/runners.py", line 189, in run
with Runner(debug=debug) as runner:
File "/usr/lib64/python3.11/asyncio/runners.py", line 63, in __exit__
self.close()
File "/usr/lib64/python3.11/asyncio/runners.py", line 71, in close
_cancel_all_tasks(loop)
File "/usr/lib64/python3.11/asyncio/runners.py", line 201, in _cancel_all_tasks
loop.run_until_complete(tasks.gather(*to_cancel, return_exceptions=True))
File "/usr/lib64/python3.11/asyncio/base_events.py", line 652, in run_until_complete
raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed.
++ error_handler 104
++ echo 'Error occurred in script at line: 104'
Error occurred in script at line: 104
++ exit 1
```
With all patches included, the shutdown now looks as follows:
```
$ kill -INT $(ps ax | grep llama_stack.distribution.server.server | grep -v nvim | awk -e '{print $1}' | sort | head -n 1)
```
```
20:56:09.308 [START]
INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO 2025-03-10 20:56:43,961 __main__:140 server: Shutting down
INFO 2025-03-10 20:56:43,962 __main__:124 server: Shutting down DatasetsRoutingTable
INFO 2025-03-10 20:56:43,964 __main__:124 server: Shutting down DatasetIORouter
INFO 2025-03-10 20:56:43,965 __main__:124 server: Shutting down ScoringFunctionsRoutingTable
INFO 2025-03-10 20:56:43,966 __main__:124 server: Shutting down ScoringRouter
INFO 2025-03-10 20:56:43,967 __main__:124 server: Shutting down ModelsRoutingTable
INFO 2025-03-10 20:56:43,968 __main__:124 server: Shutting down InferenceRouter
INFO 2025-03-10 20:56:43,969 __main__:124 server: Shutting down ShieldsRoutingTable
INFO 2025-03-10 20:56:43,971 __main__:124 server: Shutting down SafetyRouter
INFO 2025-03-10 20:56:43,972 __main__:124 server: Shutting down VectorDBsRoutingTable
INFO 2025-03-10 20:56:43,973 __main__:124 server: Shutting down VectorIORouter
INFO 2025-03-10 20:56:43,974 __main__:124 server: Shutting down ToolGroupsRoutingTable
INFO 2025-03-10 20:56:43,975 __main__:124 server: Shutting down ToolRuntimeRouter
INFO 2025-03-10 20:56:43,976 __main__:124 server: Shutting down MetaReferenceAgentsImpl
INFO 2025-03-10 20:56:43,977 __main__:124 server: Shutting down TelemetryAdapter
INFO 2025-03-10 20:56:43,978 __main__:124 server: Shutting down TorchtunePostTrainingImpl
WARNING 2025-03-10 20:56:43,979 __main__:129 server: No shutdown method for TorchtunePostTrainingImpl
INFO 2025-03-10 20:56:43,979 __main__:124 server: Shutting down BenchmarksRoutingTable
INFO 2025-03-10 20:56:43,980 __main__:124 server: Shutting down EvalRouter
INFO 2025-03-10 20:56:43,981 __main__:124 server: Shutting down DistributionInspectImpl
INFO: Application shutdown complete.
INFO: Finished server process [33862]
```
[//]: # (## Documentation)
---------
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Reverts meta-llama/llama-stack#1252
The above PR breaks the following invocation:
```bash
llama stack run ~/.llama/distributions/together/together-run.yaml
```
# What does this PR do?
Users prefer to rely on the main CLI rather than invoking the server
through a Python module. Users interact with a high-level CLI rather
than needing to know internal module structures.
Now, when running llama stack run <path-to-config>, the server will
attempt to use the system package or a virtual environment if one is
active.
This also eliminates the current process dependency chain when running
from a virtual environment:
-> llama stack run
-> start_env.sh
-> python -m server...
Signed-off-by: Sébastien Han <seb@redhat.com>
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
Run:
```
ollama run llama3.2:3b-instruct-fp16 --keepalive=2m &
llama stack run ./llama_stack/templates/ollama/run.yaml --disable-ipv6
```
Notice that the server starts and shutdowns normally.
[//]: # (## Documentation)
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
This disambiguates "Image" term from "container image" alternative usage
and allows for:
```python
if image_type == LlamaStackImagetype.venv:
...
```
accesses rather than `ImageType.venv.value`
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
Changes enum use to comply with semantic python styling and naming
conventions.
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
Refactor was automated and small so simple run-through of creating
images was done.
Signed-off-by: James Kunstle <jkunstle@redhat.com>
Concurrent requests should not trample (or reuse) each others' provider
data. Provider data should be scoped to each request.
## Test Plan
Set the uvicorn server to have a single worker process + thread by
updating the config:
```python
uvicorn_config = {
...
"workers": 1,
"loop": "asyncio",
}
```
Then perform the following steps on `origin/main` (without this change).
(1) Run the server using `llama stack run dev` without having
`FIREWORKS_API_KEY` in the environment.
(2) Run a test by specifying the FIREWORKS_API_KEY env var so it gets
stored in the thread local
```
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config http://localhost:8321 \
--text-model accounts/fireworks/models/llama-v3p1-8b-instruct \
-k test_text_chat_completion_with_tool_calling_and_streaming \
--env FIREWORKS_API_KEY=<...>
```
Ensure you don't have any other API keys in the environment (otherwise
the bug will not reproduce due to other specifics in our testing code.)
Verify this works.
(3) Run the same command again without specifying FIREWORKS_API_KEY. See
that the request actually succeeds when it *should have failed*.
----
Now do the same tests on this branch, verify step (3) results in
failure.
Finally, run the full `test_text_inference.py` test suite with this
change, verify it succeeds.
# What does this PR do?
This commit introduces a new logging system that allows loggers to be
assigned
a category while retaining the logger name based on the file name. The
log
format includes both the logger name and the category, producing output
like:
```
INFO 2025-03-03 21:44:11,323 llama_stack.distribution.stack:103 [core]: Tool_groups: builtin::websearch served by
tavily-search
```
Key features include:
- Category-based logging: Loggers can be assigned a category (e.g.,
"core", "server") when programming. The logger can be loaded like
this: `logger = get_logger(name=__name__, category="server")`
- Environment variable control: Log levels can be configured
per-category using the
`LLAMA_STACK_LOGGING` environment variable. For example:
`LLAMA_STACK_LOGGING="server=DEBUG;core=debug"` enables DEBUG level for
the "server"
and "core" categories.
- `LLAMA_STACK_LOGGING="all=debug"` sets DEBUG level globally for all
categories and
third-party libraries.
This provides fine-grained control over logging levels while maintaining
a clean and
informative log format.
The formatter uses the rich library which provides nice colors better
stack traces like so:
```
ERROR 2025-03-03 21:49:37,124 asyncio:1758 [uncategorized]: unhandled exception during asyncio.run() shutdown
task: <Task finished name='Task-16' coro=<handle_signal.<locals>.shutdown() done, defined at
/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:146>
exception=UnboundLocalError("local variable 'loop' referenced before assignment")>
╭────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────╮
│ /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:178 in shutdown │
│ │
│ 175 │ │ except asyncio.CancelledError: │
│ 176 │ │ │ pass │
│ 177 │ │ finally: │
│ ❱ 178 │ │ │ loop.stop() │
│ 179 │ │
│ 180 │ loop = asyncio.get_running_loop() │
│ 181 │ loop.create_task(shutdown()) │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
UnboundLocalError: local variable 'loop' referenced before assignment
```
Co-authored-by: Ashwin Bharambe <@ashwinb>
Signed-off-by: Sébastien Han <seb@redhat.com>
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml
INFO 2025-03-03 21:55:35,918 __main__:365 [server]: Using config file: llama_stack/templates/ollama/run.yaml
INFO 2025-03-03 21:55:35,925 __main__:378 [server]: Run configuration:
INFO 2025-03-03 21:55:35,928 __main__:380 [server]: apis:
- agents
```
[//]: # (## Documentation)
---------
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
The commit addresses the Ruff warning B008 by refactoring the code to
avoid calling SamplingParams() directly in function argument defaults.
Instead, it either uses Field(default_factory=SamplingParams) for
Pydantic models or sets the default to None and instantiates
SamplingParams inside the function body when the argument is None.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The method "dict" in class "BaseModel" is deprecated we should use
model_dump instead.
Signed-off-by: Sébastien Han <seb@redhat.com>
You now run the integration tests with these options:
```bash
Custom options:
--stack-config=STACK_CONFIG
a 'pointer' to the stack. this can be either be:
(a) a template name like `fireworks`, or
(b) a path to a run.yaml file, or
(c) an adhoc config spec, e.g.
`inference=fireworks,safety=llama-guard,agents=meta-
reference`
--env=ENV Set environment variables, e.g. --env KEY=value
--text-model=TEXT_MODEL
comma-separated list of text models. Fixture name:
text_model_id
--vision-model=VISION_MODEL
comma-separated list of vision models. Fixture name:
vision_model_id
--embedding-model=EMBEDDING_MODEL
comma-separated list of embedding models. Fixture name:
embedding_model_id
--safety-shield=SAFETY_SHIELD
comma-separated list of safety shields. Fixture name:
shield_id
--judge-model=JUDGE_MODEL
comma-separated list of judge models. Fixture name:
judge_model_id
--embedding-dimension=EMBEDDING_DIMENSION
Output dimensionality of the embedding model to use for
testing. Default: 384
--record-responses Record new API responses instead of using cached ones.
--report=REPORT Path where the test report should be written, e.g.
--report=/path/to/report.md
```
Importantly, if you don't specify any of the models (text-model,
vision-model, etc.) the relevant tests will get **skipped!**
This will make running tests somewhat more annoying since all options
will need to be specified. We will make this easier by adding some easy
wrapper yaml configs.
## Test Plan
Example:
```bash
ashwin@ashwin-mbp ~/local/llama-stack/tests/integration (unify_tests) $
LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/test_text_inference.py \
--text-model meta-llama/Llama-3.2-3B-Instruct
```
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
- From old PR, it use `BUILDS_BASE_DIR` in
`llama_stack/cli/stack/configure.py`(removed).
https://github.com/meta-llama/llama-stack/pull/371/files
- Based on the current `build` code, it should only use
`DISTRIBS_BASE_DIR` to save it.
46b0a404e8/llama_stack/cli/stack/_build.py (L298)46b0a404e8/llama_stack/cli/stack/_build.py (L301)
Pls correct me if I am understand incorrectly.
So it should no need to use in `run` now.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
Some imports were not switched to in-tree copy of the modules.
This is a follow-up to:
https://github.com/meta-llama/llama-stack/pull/1344Closes#1435
## Test Plan
Manually started the server...
[//]: # (## Documentation)
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
Inference router computes the token usage related metrics for all
providers and returns the metrics as part of response and also logs to
telemetry.
## Test Plan
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run
~/.llama/distributions/fireworks/fireworks-run.yaml
```
curl --request POST \
--url http://localhost:8321/v1/inference/chat-completion \
--header 'content-type: application/json' \
--data '{
"model_id": "meta-llama/Llama-3.1-70B-Instruct",
"messages": [
{
"role": "user",
"content": {
"type": "text",
"text": "where do humans live"
}
}
],
"stream": false
}' | jq .
{
"metrics": [
{
"trace_id": "yjv1tf0jS1evOyPm",
"span_id": "WqYKvg0_",
"timestamp": "2025-02-27T18:55:10.770903Z",
"attributes": {
"model_id": "meta-llama/Llama-3.1-70B-Instruct",
"provider_id": "fireworks"
},
"type": "metric",
"metric": "prompt_tokens",
"value": 10,
"unit": "tokens"
},
{
"trace_id": "yjv1tf0jS1evOyPm",
"span_id": "WqYKvg0_",
"timestamp": "2025-02-27T18:55:10.770916Z",
"attributes": {
"model_id": "meta-llama/Llama-3.1-70B-Instruct",
"provider_id": "fireworks"
},
"type": "metric",
"metric": "completion_tokens",
"value": 411,
"unit": "tokens"
},
{
"trace_id": "yjv1tf0jS1evOyPm",
"span_id": "WqYKvg0_",
"timestamp": "2025-02-27T18:55:10.770919Z",
"attributes": {
"model_id": "meta-llama/Llama-3.1-70B-Instruct",
"provider_id": "fireworks"
},
"type": "metric",
"metric": "total_tokens",
"value": 421,
"unit": "tokens"
}
],
"completion_message": {
"role": "assistant",
"content": "Humans live in various parts of the world, inhabiting almost every continent, country, and region. Here's a breakdown of where humans live:\n\n1. **Continents:** Humans inhabit all seven continents:\n\t* Africa\n\t* Antarctica (research stations only)\n\t* Asia\n\t* Australia\n\t* Europe\n\t* North America\n\t* South America\n2. **Countries:** There are 196 countries recognized by the United Nations, and humans live in almost all of them.\n3. **Regions:** Humans live in diverse regions, including:\n\t* Deserts (e.g., Sahara, Mojave)\n\t* Forests (e.g., Amazon, Congo)\n\t* Grasslands (e.g., Prairies, Steppes)\n\t* Mountains (e.g., Himalayas, Andes)\n\t* Oceans (e.g., coastal areas, islands)\n\t* Tundras (e.g., Arctic, sub-Arctic)\n4. **Cities and towns:** Many humans live in urban areas, such as cities and towns, which are often located near:\n\t* Coastlines\n\t* Rivers\n\t* Lakes\n\t* Mountains\n5. **Rural areas:** Some humans live in rural areas, such as:\n\t* Villages\n\t* Farms\n\t* Countryside\n6. **Islands:** Humans inhabit many islands, including:\n\t* Tropical islands (e.g., Hawaii, Maldives)\n\t* Arctic islands (e.g., Greenland, Iceland)\n\t* Continental islands (e.g., Great Britain, Ireland)\n7. **Extreme environments:** Humans also live in extreme environments, such as:\n\t* High-altitude areas (e.g., Tibet, Andes)\n\t* Low-altitude areas (e.g., Death Valley, Dead Sea)\n\t* Areas with extreme temperatures (e.g., Arctic, Sahara)\n\nOverall, humans have adapted to live in a wide range of environments and ecosystems around the world.",
"stop_reason": "end_of_turn",
"tool_calls": []
},
"logprobs": null
}
```
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/integration/inference
======================================================================== short test summary info =========================================================================
FAILED tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B:vis=11B-inference:chat_completion:tool_calling_tools_absent-True] - ValueError: Unsupported tool prompt format: ToolPromptFormat.json
FAILED tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B:vis=11B-inference:chat_completion:tool_calling_tools_absent-False] - ValueError: Unsupported tool prompt format: ToolPromptFormat.json
FAILED tests/integration/inference/test_vision_inference.py::test_image_chat_completion_non_streaming[txt=8B:vis=11B] - fireworks.client.error.InvalidRequestError: {'error': {'object': 'error', 'type': 'invalid_request_error', 'message': 'Failed to decode image cannot identify image f...
FAILED tests/integration/inference/test_vision_inference.py::test_image_chat_completion_streaming[txt=8B:vis=11B] - fireworks.client.error.InvalidRequestError: {'error': {'object': 'error', 'type': 'invalid_request_error', 'message': 'Failed to decode image cannot identify image f...
========================================================= 4 failed, 16 passed, 23 xfailed, 17 warnings in 44.36s =========================================================
```
# What does this PR do?
When going through READMEs, I found that I had to keep editing the code
blocks since they were prefixed with `$ `. A common pattern is to triple
click (highlight all) a block and then copy paste. This minor change
will make this easier for folks to follow the READMEs.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
N/A
[//]: # (## Documentation)
# What does this PR do?
The agent API allows to query multiple DBs using the `vector_db_ids`
argument of the `rag` tool:
```py
toolgroups=[
{
"name": "builtin::rag",
"args": {"vector_db_ids": [vector_db_id]},
}
],
```
This means that multiple DBs can be used to compose an aggregated
context by executing the query on each of them.
When documents are passed to the next agent turn, there is no explicit
way to configure the vector DB where the embeddings will be ingested. In
such cases, we can assume that:
- if any `vector_db_ids` is given, we use the first one (it probably
makes sense to assume that it's the only one in the list, otherwise we
should loop on all the given DBs to have a consistent ingestion)
- if no `vector_db_ids` is given, we can use the current logic to
generate a default DB using the default provider. If multiple providers
are defined, the API will fail as expected: the user has to provide
details on where to ingest the documents.
(Closes#1270)
## Test Plan
The issue description details how to replicate the problem.
[//]: # (## Documentation)
---------
Signed-off-by: Daniele Martinoli <dmartino@redhat.com>
All of the tests from `llama_stack/providers/tests/` are now moved to
`tests/integration`.
I converted the `tools`, `scoring` and `datasetio` tests to use API.
However, `eval` and `post_training` proved to be a bit challenging to
leaving those. I think `post_training` should be relatively
straightforward also.
As part of this, I noticed that `wolfram_alpha` tool wasn't added to
some of our commonly used distros so I added it. I am going to remove a
lot of code duplication from distros next so while this looks like a
one-off right now, it will go away and be there uniformly for all
distros.
# What does this PR do?
- This was missed from previous deprecation:
https://github.com/meta-llama/llama-stack/pull/1186
- Part of https://github.com/meta-llama/llama-stack/issues/1396
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
pytest -v -s --nbval-lax ./llama-stack/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb
```
[//]: # (## Documentation)
# What does this PR do?
- Modularized `resolve_impls` by extracting helper functions for
validation, sorting, and instantiation.
- Improved readability by introducing `validate_and_prepare_providers`,
`sort_providers_by_dependency`, and `instantiate_providers`.
- Enhanced type safety with explicit type hints (`Tuple`, `Dict`, `Set`,
etc.).
- Fixed potential issues with provider module imports and added error
handling.
- Updated `pyproject.toml` to enforce type checking on `resolver.py`
using `mypy`.
Signed-off-by: Sébastien Han <seb@redhat.com>
- [//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
Run the server.
[//]: # (## Documentation)
Signed-off-by: Sébastien Han <seb@redhat.com>
A self-respecting server needs good observability which starts with
configurable logging. Llama Stack had little until now. This PR adds a
`logcat` facility towards that. Callsites look like:
```python
logcat.debug("inference", f"params to ollama: {params}")
```
- the first parameter is a category. there is a static list of
categories in `llama_stack/logcat.py`
- each category can be associated with a log-level which can be
configured via the `LLAMA_STACK_LOGGING` env var.
- a value `LLAMA_STACK_LOGGING=inference=debug;server=info"` does the
obvious thing. there is a special key called `all` which is an alias for
all categories
## Test Plan
Ran with `LLAMA_STACK_LOGGING="all=debug" llama stack run fireworks` and
saw the following:

Hit it with a client-sdk test case and saw this:

# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
Sorry for the https://github.com/meta-llama/llama-stack/pull/1340 logic,
it will cause issue if in `non-container` env.
```
Using conda <<<<<<<------ environment: stack
+ is_command_available docker
+ command -v docker
+ printf '\033[0;31mError: docker command not found. Is docker installed and in your PATH?\033[0m'
Error: docker command not found. Is docker installed and in your PATH?+ exit 1
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
`start_venv.sh` lifecycle should be:
025f615868
>>
34e3faa4e8
>>
4684fd3f8d
Finally replaced by `start_stack.sh`
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
We want to bundle a bunch of (typically remote) providers in a distro
template and be able to configure them "on the fly" via environment
variables. So far, we have been able to do this with simple env var
replacements. However, sometimes you want to only conditionally enable
providers (because the relevant remote services may not be alive, or
relevant.) This was not possible until now.
To aid this, we add a simple (bash-like) env var replacement
enhancement: `${env.FOO+bar}` evaluates to `bar` if the variable is SET
and evaluates to empty string if it is not. On top of that, we update
our main resolver to ignore any provider whose ID is null.
This allows using the distro like this:
```bash
llama stack run dev --env CHROMADB_URL=http://localhost:6001 --env ENABLE_CHROMADB=1
```
when only Chroma is UP. This disables the other `pgvector` provider in
the run configuration.
## Test Plan
Hard code `chromadb` as the vector io provider inside
`test_vector_io.py` and run:
```bash
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -s -v tests/client-sdk/vector_io/ --embedding-model all-MiniLM-L6-v2
```
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# Summary:
This led to extremely hard to debug messages.
Before:
llama_stack/distribution/library_client.py:275: in request
response = await self._call_non_streaming(
llama_stack/distribution/library_client.py:322: in _call_non_streaming
result = await matched_func(**body)
llama_stack/providers/utils/telemetry/trace_protocol.py:102: in
async_wrapper
result = await method(self, *args, **kwargs)
llama_stack/providers/inline/agents/meta_reference/agents.py:80: in
create_agent
value=agent_config.model_dump_json(),
E AttributeError: 'dict' object has no attribute 'model_dump_json'
After:
E ValueError: Failed to convert parameter {'model':
'meta-llama/Llama-3.1-8B-Instruct', 'instructions': 'You are a helpful
assistant', 'sampling_params': {'strategy': {'type': 'top_p',
'temperature': 0.0001, 'top_p': 0.9}}, 'toolgroups': [{'name':
'builtin::rag'}], 'input_shields': ['meta-llama/Llama-Guard-3-8B'],
'output_shields': ['meta-llama/Llama-Guard-3-8B'],
'enable_session_persistence': False} into <class
'llama_stack.apis.agents.agents.AgentConfig'>: 2 validation errors for
AgentConfig
E toolgroups.0.str
E Input should be a valid string [type=string_type, input_value={'name':
'builtin::rag'}, input_type=dict]
E For further information visit
https://errors.pydantic.dev/2.10/v/string_type
E toolgroups.0.AgentToolGroupWithArgs.args
E Field required [type=missing, input_value={'name': 'builtin::rag'},
input_type=dict]
E For further information visit
https://errors.pydantic.dev/2.10/v/missing
# Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/
--safety-shield meta-llama/Llama-Guard-3-8B
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
check conda env name using basepath in exec.py
The current logic for finding conda prefix does a `endswith` check with
just the conda env name, but this will cause us to match incorrect if
there is a different conda env which ends with same suffix. In my case,
i had stack and llama-stack as the two conda envs.
## Test Plan
llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml
# What does this PR do?
Tool format depends on the model. @ehhuang introduced a
`get_default_tool_prompt_format` function for this purpose. We should
use that instead of hacky model ID matching we had before.
Secondly, non llama models don't have this concept so testing with those
models should work as is.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```bash
for distro in fireworks ollama; do
LLAMA_STACK_CONFIG=$distro \
pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model=meta-llama/Llama-3.2-3B-Instruct \
--vision-inference-model=""
done
LLAMA_STACK_CONFIG=dev \
pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model=openai/gpt-4o \
--vision-inference-model=""
```
[//]: # (## Documentation)
Summary:
Lets the model decide which tool it needs to call to respond to a query.
Test Plan:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/ --safety-shield meta-llama/Llama-Guard-3-8B
```
Also evaluated on a small benchmark with 20 questions from HotpotQA.
With this PR and some prompting, the performance is 77% recall compared
to 50% currently.
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1015).
* #1268
* #1239
* __->__ #1015
# What does this PR do?
- Introduced logging in `StackRun` to replace print-based messages
- Improved error handling for config file loading and parsing
- Replaced `cprint` with `logger.error` for consistent error messaging
- Ensured logging is used in `server.py` for startup, shutdown, and
runtime messages
- Added missing exception handling for invalid providers
Signed-off-by: Sébastien Han <seb@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
currently, build_venv.sh expects a `distribution_type` as the first
argument but the only things ever passed are:
1. image name
2. pip dependencies
so distribution_type is never passed in meaning the script errors when
calling something like:
`llama stack build --image-type venv --template ollama --image-name
test`
before output:
```
llama stack build --image-type venv --template ollama --image-name venv-test
Usage: /Users/charliedoern/projects/Documents/llama-stack/llama_stack/distribution/build_venv.sh <distribution_type> <env_name> <pip_dependencies> [<special_pip_deps>]
Example: /Users/charliedoern/projects/Documents/llama-stack/llama_stack/distribution/build_venv.sh <distribution_type> mybuild ./my-stack-build.yaml 'numpy pandas scipy'
Failed to build target venv-test with return code 1
Run config path is empty
```
after:
```
llama stack build --image-type venv --template ollama --image-name venv-test
Environment 'venv-test' already exists, re-using it.
Using virtual environment venv-test
Using CPython 3.13.0 interpreter at: /opt/homebrew/opt/python@3.13/bin/python3.13
Creating virtual environment at: venv-test
Activate with: source venv-test/bin/activate
Using Python 3.13.0 environment at: venv-test
Resolved 55 packages in 640ms
Built fire==0.7.0
Prepared 54 packages in 1.14s
Installed 55 packages in 82ms
+ annotated-types==0.7.0
```
## Test Plan
ran locally with output above
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
now that llama stack supports running in venv, conda, and container
modes and the 3 scripts overlap alot, combine these three into ons
`start_stack.sh` script
## Test Plan
tested this locally on venv, conda, and container
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
Summary:
Currently we don't set the best tool_prompt_format according to model as
promisd.
Test Plan:
Added print around raw model input and inspected manually
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1214).
* #1234
* __->__ #1214