llama-stack

forked from phoenix-oss/llama-stack-mirror

Author	SHA1	Message	Date
Dinesh Yeduguru	86f617a197	fix: tracing middleware to not start for lifespan events (#1730 ) # What does this PR do? Tracing middleware should not start tracing for lifespan events. Lifespan event happens at server startup and shutdown and if we start tracing for them, we will have an active trace for the lifetime of the server, which messes up with regular tracing since we always expect the traces to be never nested. We started hitting this issue since https://github.com/meta-llama/llama-stack/pull/1495. ## Test Plan * llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml * Verify in sqlite store that the trace now has non null span id ![Screenshot 2025-03-20 at 1 49 47 PM](https://github.com/user-attachments/assets/d77354a7-d5f1-4b53-a946-6adbd7a4f772)	2025-03-20 14:22:19 -07:00
Ashwin Bharambe	01a25d9744	feat(server): add attribute based access control for resources (#1703 ) This PR introduces a way to implement Attribute Based Access Control (ABAC) for the Llama Stack server. The rough design is: - https://github.com/meta-llama/llama-stack/pull/1626 added a way for the Llama Stack server to query an authenticator - We build upon that and expect "access attributes" as part of the response. These attributes indicate the scopes available for the request. - We use these attributes to perform access control for registered resources as well as for constructing the default access control policies for newly created resources. - By default, if you support authentication but don't return access attributes, we will add a unique namespace pointing to the API_KEY. That way, all resources by default will be scoped to API_KEYs. An important aspect of this design is that Llama Stack stays out of the business of credential management or the CRUD for attributes. How you manage your namespaces or projects is entirely up to you. The design only implements access control checks for the metadata / book-keeping information that the Stack tracks. ### Limitations - Currently, read vs. write vs. admin permissions aren't made explicit, but this can be easily extended by adding appropriate attributes to the `AccessAttributes` data structure. - This design does not apply to agent instances since they are not considered resources the Stack knows about. Agent instances are completely within the scope of the Agents API provider. ### Test Plan Added unit tests, existing integration tests	2025-03-19 21:28:52 -07:00
Charlie Doern	1f04ca357b	fix: telemetry logger (#1714 ) # What does this PR do? currently if you have a run yaml without temeletry the following error is hit: TypeError: TelemetryAdapter.__init__() missing 1 required positional argument: 'deps' this is because the TelemetryAdapter requires a deps arg to be passed. Pass {} to avoid errors. Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-03-19 20:26:13 -07:00
Ashwin Bharambe	5b39d5a76a	feat(auth, rfc): Add support for Bearer (api_key) Authentication (#1626 ) This PR adds support (or is a proposal for) for supporting API KEY authentication on the Llama Stack server end. `llama-stack-client` already supports accepting an api_key parameter and passes it down through every request as an `Authentication: ` header. Currently, Llama Stack does not propose APIs for handling authentication or authorization for resources of any kind. Given that, and the fact that any deployment will typically have _some_ authentication system present, we simply adopt a delegation mechanism: delegate to an HTTPS endpoint performing key management / authentication. It is configured via: ```yaml server: auth: endpoint: <...> ``` in the run.yaml configuration. ## How It Works When authentication is enabled: 1. Every API request must include an `Authorization: Bearer <token>` header 2. The server will send a _POST_ validation request to the configured endpoint with the following payload: ```json { "api_key": "<token>", "request": { "path": "/api/path", "headers": { "header1": "value1", ... }, "params": { "param1": "value1", ... } } } ``` 3. If the authentication endpoint returns a 200 status code, the request is allowed to proceed 4. If the authentication endpoint returns any other status code, a 401 Unauthorized response is returned ## Test Plan Unit tests	2025-03-18 16:24:18 -07:00
Charlie Doern	78d4872c0c	feat: add support for logging config in the run.yaml (#1408 ) # What does this PR do? a user should be able to store a static logging configuration outside of their environment. This would make sense to store in the run yaml given that we store other things like server configuration in there. The environment variable settings override the config settings if both are available. The format in the config looks like this: ``` logging_config: category_levels: VALID_CATEGORY: VALID_STRING_LOG_LEVEL ``` any specified category out of the following: `core \| server \| router \| inference \| agents \| safety \| eval \| tools \| client` combined with any of the following log levels: `debug \| info \| warning \| error \| critical` can be placed in the category_levels list in order to achieve the desired log level ## Test Plan Test locally with a run config like the following: ``` version: '2' image_name: ollama logging_config: category_levels: server: debug apis: ... ``` Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-03-14 12:36:25 -07:00
Charlie Doern	a062723d03	feat: add provider API for listing and inspecting provider info (#1429 ) # What does this PR do? currently the `inspect` API for providers is really a `list` API. Create a new `providers` API which has a GET `providers/{provider_id}` inspect API which returns "user friendly" configuration to the end user. Also add a GET `/providers` endpoint which returns the list of providers as `inspect/providers` does today. This API follows CRUD and is more intuitive/RESTful. This work is part of the RFC at https://github.com/meta-llama/llama-stack/pull/1359 sensitive fields are redacted using `redact_sensetive_fields` on the server side before returning a response: <img width="456" alt="Screenshot 2025-03-13 at 4 40 21 PM" src="https://github.com/user-attachments/assets/9465c221-2a26-42f8-a08a-6ac4a9fecce8" /> ## Test Plan using https://github.com/meta-llama/llama-stack-client-python/pull/181 a user is able to to run the following: `llama stack build --template ollama --image-type venv` `llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml` `llama-stack-client providers inspect ollama` <img width="378" alt="Screenshot 2025-03-13 at 4 39 35 PM" src="https://github.com/user-attachments/assets/8273d05d-8bc3-44c6-9e4b-ef95e48d5466" /> also, was able to run the new test_list integration test locally with ollama: <img width="1509" alt="Screenshot 2025-03-13 at 11 03 40 AM" src="https://github.com/user-attachments/assets/9b9db166-f02f-45b0-86a4-306d85149bc8" /> Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-03-13 15:07:21 -07:00
Dinesh Yeduguru	58d08d100e	feat: Add back inference metrics and preserve context variables across asyncio boundary (#1552 ) # What does this PR do? This PR adds back the changes in #1300 which were reverted in #1476 . It also adds logic to preserve context variables across asyncio boundary. this is needed with the library client since the async generator logic yields control to code outside the event loop, and on resuming, does not have the same context as before and this requires preserving the context vars. address #1477 ## Test Plan ``` curl --request POST \ --url http://localhost:8321/v1/inference/chat-completion \ --header 'content-type: application/json' \ --data '{ "model_id": "meta-llama/Llama-3.1-70B-Instruct", "messages": [ { "role": "user", "content": { "type": "text", "text": "where do humans live" } } ], "stream": false }' \| jq . { "metrics": [ { "trace_id": "kCZwO3tyQC-FuAGb", "span_id": "bsP_5a5O", "timestamp": "2025-03-11T16:47:38.549084Z", "attributes": { "model_id": "meta-llama/Llama-3.1-70B-Instruct", "provider_id": "fireworks" }, "type": "metric", "metric": "prompt_tokens", "value": 10, "unit": "tokens" }, { "trace_id": "kCZwO3tyQC-FuAGb", "span_id": "bsP_5a5O", "timestamp": "2025-03-11T16:47:38.549449Z", "attributes": { "model_id": "meta-llama/Llama-3.1-70B-Instruct", "provider_id": "fireworks" }, "type": "metric", "metric": "completion_tokens", "value": 369, "unit": "tokens" }, { "trace_id": "kCZwO3tyQC-FuAGb", "span_id": "bsP_5a5O", "timestamp": "2025-03-11T16:47:38.549457Z", "attributes": { "model_id": "meta-llama/Llama-3.1-70B-Instruct", "provider_id": "fireworks" }, "type": "metric", "metric": "total_tokens", "value": 379, "unit": "tokens" } ], "completion_message": { "role": "assistant", "content": "Humans live on the planet Earth, specifically on its landmasses and in its oceans. Here's a breakdown of where humans live:\n\n1. Continents: Humans inhabit all seven continents:\n\t* Africa\n\t* Antarctica ( temporary residents, mostly scientists and researchers)\n\t* Asia\n\t* Australia\n\t* Europe\n\t* North America\n\t* South America\n2. Countries: There are 196 countries recognized by the United Nations, and humans live in almost all of them.\n3. Cities and towns: Many humans live in urban areas, such as cities and towns, which are often located near coastlines, rivers, or other bodies of water.\n4. Rural areas: Some humans live in rural areas, such as villages, farms, and countryside.\n5. Islands: Humans inhabit many islands around the world, including those in the Pacific, Indian, and Atlantic Oceans.\n6. Mountains and highlands: Humans live in mountainous regions, such as the Himalayas, the Andes, and the Rocky Mountains.\n7. Deserts: Some humans live in desert regions, such as the Sahara, the Mojave, and the Atacama.\n8. Coastal areas: Many humans live in coastal areas, such as beaches, ports, and coastal cities.\n9. Underwater habitats: A few humans live in underwater habitats, such as research stations and submarines.\n10. Space: A small number of humans have lived in space, including astronauts on the International Space Station and those who have visited the Moon.\n\nOverall, humans can be found living in almost every environment on Earth, from the frozen tundra to the hottest deserts, and from the highest mountains to the deepest oceans.", "stop_reason": "end_of_turn", "tool_calls": [] }, "logprobs": null } ``` Orignal repro no longer showing any error: ``` LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml python -m examples.agents.e2e_loop_with_client_tools localhost 8321 ``` client logs: https://gist.github.com/dineshyv/047c7e87b18a5792aa660e311ea53166 server logs: https://gist.github.com/dineshyv/97a2174099619e9916c7c490be26e559	2025-03-12 12:01:03 -07:00
Charlie Doern	4eee349acd	fix: respect log_level in uvicorn and third party libs (#1524 ) # What does this PR do? uvicorn has a `log_level` arg in uvicorn.run, pass in the effective level set by the logger. Additionally, third party libraries like httpx are using our logging format, but not honoring our log level. This seems unintended, so loop through all items in the loggerDict and apply the same log level as what we have set. ## Test Plan before: ``` llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml Environment variable LLAMA_STACK_LOGGING found: all=warn Using virtual environment: /Users/charliedoern/projects/Documents/llama-stack/venv + python -m llama_stack.distribution.server.server --yaml-config /Users/charliedoern/.llama/distributions/ollama/ollama-run.yaml --port 8321 Environment variable LLAMA_STACK_LOGGING found: all=warn WARNING 2025-03-10 16:05:49,706 root:71 uncategorized: Warning: `bwrap` is not available. Code interpreter tool will not work correctly. INFO 2025-03-10 16:05:49,916 datasets:54 uncategorized: PyTorch version 2.5.1 available. INFO 2025-03-10 16:05:50,010 httpx:1740 uncategorized: HTTP Request: GET http://localhost:11434/api/ps "HTTP/1.1 200 OK" INFO 2025-03-10 16:05:50,297 httpx:1740 uncategorized: HTTP Request: POST http://localhost:11434/api/pull "HTTP/1.1 200 OK" INFO 2025-03-10 16:05:50,314 httpx:1740 uncategorized: HTTP Request: GET http://localhost:11434/api/tags "HTTP/1.1 200 OK" INFO: Started server process [89663] INFO: Waiting for application startup. INFO: ASGI 'lifespan' protocol appears unsupported. INFO: Application startup complete. INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) ``` after: ``` llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml Environment variable LLAMA_STACK_LOGGING found: all=warn Using virtual environment: /Users/charliedoern/projects/Documents/llama-stack/venv + python -m llama_stack.distribution.server.server --yaml-config /Users/charliedoern/.llama/distributions/ollama/ollama-run.yaml --port 8321 Environment variable LLAMA_STACK_LOGGING found: all=warn WARNING 2025-03-10 16:05:20,429 root:71 uncategorized: Warning: `bwrap` is not available. Code interpreter tool will not work correctly. INFO 2025-03-10 16:05:20,639 datasets:54 uncategorized: PyTorch version 2.5.1 available. ``` Signed-off-by: Charlie Doern <cdoern@redhat.com>	2025-03-12 11:07:28 -07:00
Ihar Hrachyshka	aca82df7ed	fix: Multiple fixes for server shutdown (fix lifespan handling; fix handling CancelledError when raised by provider; let uvicorn handle signals) (#1495 ) # What does this PR do? If implementation raises CancelledError (e.g. when it runs its own async loop for jobs), the main server shutdown handler gets confused and doesn't attempt to shut down the main loop tasks. While at it, also fixing the following failure when this happens: ``` UnboundLocalError: cannot access local variable 'loop' where it is not associated with a value ``` Shutdown handlers were not running because lifespan logic was broken since ~Oct 2024. Fixed that too and enforcing `lifespan` now (making sure server will crash when it fails to interact with app through middleware). [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Spotted while working on https://github.com/meta-llama/llama-stack/pull/1437 One way to trigger it without the PR above is to add `raise CancelledError` in any of the running providers' `shutdown` methods; then `kill -INT <pid>` the server process. Validated this with the following test patch: ``` diff --git a/llama_stack/distribution/server/server.py b/llama_stack/distribution/server/server.py index b85c463a..10dad83e 100644 --- a/llama_stack/distribution/server/server.py +++ b/llama_stack/distribution/server/server.py @@ -174,6 +174,7 @@ def handle_signal(app, signum, _) -> None: except asyncio.CancelledError: pass finally: + logger.info("Stopping event loop") loop.stop() loop = asyncio.get_running_loop() diff --git a/llama_stack/providers/inline/post_training/torchtune/post_training.py b/llama_stack/providers/inline/post_training/torchtune/post_training.py index b837362d..163f43d8 100644 --- a/llama_stack/providers/inline/post_training/torchtune/post_training.py +++ b/llama_stack/providers/inline/post_training/torchtune/post_training.py @@ -3,6 +3,7 @@ # # This source code is licensed under the terms described in the LICENSE file in # the root directory of this source tree. +import asyncio from datetime import datetime from typing import Any, Dict, Optional @@ -43,6 +44,9 @@ class TorchtunePostTrainingImpl: self.jobs = {} self.checkpoints_dict = {} + async def shutdown(self) -> None: + raise asyncio.CancelledError("Shutdown") + async def supervised_fine_tune( self, job_uuid: str, ``` Without the fix: ``` INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) INFO: Shutting down INFO: Finished server process [52099] INFO 2025-03-07 23:25:33,548 __main__:143 server: Received signal SIGINT (2). Exiting gracefully... INFO 2025-03-07 23:25:33,550 __main__:150 server: Shutting down DatasetsRoutingTable INFO 2025-03-07 23:25:33,551 __main__:177 server: Stopping event loop ERROR 2025-03-07 23:25:33,552 asyncio:1785 uncategorized: unhandled exception during asyncio.run() shutdown task: <Task finished name='Task-12' coro=<handle_signal.<locals>.shutdown() done, defined at /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py:145> exception=UnboundLocalError("cannot access local variable 'loop' where it is not associated with a value")> ╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮ │ /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py:178 in shutdown │ │ │ │ 175 │ │ │ pass │ │ 176 │ │ finally: │ │ 177 │ │ │ logger.info("Stopping event loop") │ │ ❱ 178 │ │ │ loop.stop() │ │ 179 │ │ │ 180 │ loop = asyncio.get_running_loop() │ │ 181 │ loop.create_task(shutdown()) │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ UnboundLocalError: cannot access local variable 'loop' where it is not associated with a value ``` With the fix, now seeing the following messages when the server is killed: ``` INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) INFO: Shutting down INFO: Finished server process [50836] INFO 2025-03-07 23:20:35,182 __main__:143 server: Received signal SIGINT (2). Exiting gracefully... INFO 2025-03-07 23:20:35,184 __main__:149 server: Shutting down DatasetsRoutingTable ERROR 2025-03-07 23:20:35,185 __main__:158 server: Failed to shutdown DatasetsRoutingTable: {CancelledError()} ╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮ │ /usr/lib64/python3.11/asyncio/tasks.py:476 in wait_for │ │ │ │ 473 │ try: │ │ 474 │ │ # wait until the future completes or the timeout │ │ 475 │ │ try: │ │ ❱ 476 │ │ │ await waiter │ │ 477 │ │ except exceptions.CancelledError: │ │ 478 │ │ │ if fut.done(): │ │ 479 │ │ │ │ return fut.result() │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ CancelledError During handling of the above exception, another exception occurred: ╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮ │ /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py:152 in shutdown │ │ │ │ 149 │ │ │ logger.info("Shutting down %s", impl_name) │ │ 150 │ │ │ try: │ │ 151 │ │ │ │ if hasattr(impl, "shutdown"): │ │ ❱ 152 │ │ │ │ │ await asyncio.wait_for(impl.shutdown(), timeout=5) │ │ 153 │ │ │ │ else: │ │ 154 │ │ │ │ │ logger.warning("No shutdown method for %s", impl_name) │ │ 155 │ │ │ except asyncio.TimeoutError: │ │ │ │ /usr/lib64/python3.11/asyncio/tasks.py:479 in wait_for │ │ │ │ 476 │ │ │ await waiter │ │ 477 │ │ except exceptions.CancelledError: │ │ 478 │ │ │ if fut.done(): │ │ ❱ 479 │ │ │ │ return fut.result() │ │ 480 │ │ │ else: │ │ 481 │ │ │ │ fut.remove_done_callback(cb) │ │ 482 │ │ │ │ # We must ensure that the task is not running │ │ │ │ /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/routers/routing_tables.py:131 in shutdown │ │ │ │ 128 │ │ │ elif api == Api.tool_runtime: │ │ 129 │ │ │ │ p.tool_store = self │ │ 130 │ │ │ ❱ 131 │ async def shutdown(self) -> None: │ │ 132 │ │ for p in self.impls_by_provider_id.values(): │ │ 133 │ │ │ await p.shutdown() │ │ 134 │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ CancelledError INFO 2025-03-07 23:20:35,295 __main__:149 server: Shutting down DatasetIORouter INFO 2025-03-07 23:20:35,296 __main__:149 server: Shutting down ScoringFunctionsRoutingTable INFO 2025-03-07 23:20:35,297 __main__:149 server: Shutting down ScoringRouter INFO 2025-03-07 23:20:35,298 __main__:149 server: Shutting down ModelsRoutingTable INFO 2025-03-07 23:20:35,299 __main__:149 server: Shutting down InferenceRouter INFO 2025-03-07 23:20:35,300 __main__:149 server: Shutting down ShieldsRoutingTable INFO 2025-03-07 23:20:35,300 __main__:149 server: Shutting down SafetyRouter INFO 2025-03-07 23:20:35,301 __main__:149 server: Shutting down VectorDBsRoutingTable INFO 2025-03-07 23:20:35,302 __main__:149 server: Shutting down VectorIORouter INFO 2025-03-07 23:20:35,303 __main__:149 server: Shutting down ToolGroupsRoutingTable INFO 2025-03-07 23:20:35,304 __main__:149 server: Shutting down ToolRuntimeRouter INFO 2025-03-07 23:20:35,304 __main__:149 server: Shutting down MetaReferenceAgentsImpl INFO 2025-03-07 23:20:35,305 __main__:149 server: Shutting down TelemetryAdapter INFO 2025-03-07 23:20:35,306 __main__:149 server: Shutting down TorchtunePostTrainingImpl ERROR 2025-03-07 23:20:35,307 __main__:158 server: Failed to shutdown TorchtunePostTrainingImpl: {CancelledError('Shutdown')} ╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮ │ /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py:152 in shutdown │ │ │ │ 149 │ │ │ logger.info("Shutting down %s", impl_name) │ │ 150 │ │ │ try: │ │ 151 │ │ │ │ if hasattr(impl, "shutdown"): │ │ ❱ 152 │ │ │ │ │ await asyncio.wait_for(impl.shutdown(), timeout=5) │ │ 153 │ │ │ │ else: │ │ 154 │ │ │ │ │ logger.warning("No shutdown method for %s", impl_name) │ │ 155 │ │ │ except asyncio.TimeoutError: │ │ │ │ /usr/lib64/python3.11/asyncio/tasks.py:489 in wait_for │ │ │ │ 486 │ │ │ │ raise │ │ 487 │ │ │ │ 488 │ │ if fut.done(): │ │ ❱ 489 │ │ │ return fut.result() │ │ 490 │ │ else: │ │ 491 │ │ │ fut.remove_done_callback(cb) │ │ 492 │ │ │ # We must ensure that the task is not running │ │ │ │ /home/ec2-user/src/llama-stack/schedule/llama_stack/providers/inline/post_training/torchtune/post_training. │ │ py:48 in shutdown │ │ │ │ 45 │ │ self.checkpoints_dict = {} │ │ 46 │ │ │ 47 │ async def shutdown(self) -> None: │ │ ❱ 48 │ │ raise asyncio.CancelledError("Shutdown") │ │ 49 │ │ │ 50 │ async def supervised_fine_tune( │ │ 51 │ │ self, │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ CancelledError: Shutdown INFO 2025-03-07 23:20:35,352 __main__:149 server: Shutting down BenchmarksRoutingTable INFO 2025-03-07 23:20:35,353 __main__:149 server: Shutting down EvalRouter INFO 2025-03-07 23:20:35,354 __main__:149 server: Shutting down DistributionInspectImpl INFO 2025-03-07 23:20:35,355 __main__:177 server: Stopping event loop Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py", line 488, in <module> main() File "/home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py", line 476, in main uvicorn.run(*uvicorn_config) File "/home/ec2-user/src/llama-stack/schedule/venv/lib64/python3.11/site-packages/uvicorn/main.py", line 579, in run server.run() File "/home/ec2-user/src/llama-stack/schedule/venv/lib64/python3.11/site-packages/uvicorn/server.py", line 66, in run return asyncio.run(self.serve(sockets=sockets)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.11/asyncio/runners.py", line 189, in run with Runner(debug=debug) as runner: File "/usr/lib64/python3.11/asyncio/runners.py", line 63, in __exit__ self.close() File "/usr/lib64/python3.11/asyncio/runners.py", line 71, in close _cancel_all_tasks(loop) File "/usr/lib64/python3.11/asyncio/runners.py", line 201, in _cancel_all_tasks loop.run_until_complete(tasks.gather(to_cancel, return_exceptions=True)) File "/usr/lib64/python3.11/asyncio/base_events.py", line 652, in run_until_complete raise RuntimeError('Event loop stopped before Future completed.') RuntimeError: Event loop stopped before Future completed. ++ error_handler 104 ++ echo 'Error occurred in script at line: 104' Error occurred in script at line: 104 ++ exit 1 ``` With all patches included, the shutdown now looks as follows: ``` $ kill -INT $(ps ax \| grep llama_stack.distribution.server.server \| grep -v nvim \| awk -e '{print $1}' \| sort \| head -n 1) ``` ``` 20:56:09.308 [START] INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) INFO: Shutting down INFO: Waiting for application shutdown. INFO 2025-03-10 20:56:43,961 __main__:140 server: Shutting down INFO 2025-03-10 20:56:43,962 __main__:124 server: Shutting down DatasetsRoutingTable INFO 2025-03-10 20:56:43,964 __main__:124 server: Shutting down DatasetIORouter INFO 2025-03-10 20:56:43,965 __main__:124 server: Shutting down ScoringFunctionsRoutingTable INFO 2025-03-10 20:56:43,966 __main__:124 server: Shutting down ScoringRouter INFO 2025-03-10 20:56:43,967 __main__:124 server: Shutting down ModelsRoutingTable INFO 2025-03-10 20:56:43,968 __main__:124 server: Shutting down InferenceRouter INFO 2025-03-10 20:56:43,969 __main__:124 server: Shutting down ShieldsRoutingTable INFO 2025-03-10 20:56:43,971 __main__:124 server: Shutting down SafetyRouter INFO 2025-03-10 20:56:43,972 __main__:124 server: Shutting down VectorDBsRoutingTable INFO 2025-03-10 20:56:43,973 __main__:124 server: Shutting down VectorIORouter INFO 2025-03-10 20:56:43,974 __main__:124 server: Shutting down ToolGroupsRoutingTable INFO 2025-03-10 20:56:43,975 __main__:124 server: Shutting down ToolRuntimeRouter INFO 2025-03-10 20:56:43,976 __main__:124 server: Shutting down MetaReferenceAgentsImpl INFO 2025-03-10 20:56:43,977 __main__:124 server: Shutting down TelemetryAdapter INFO 2025-03-10 20:56:43,978 __main__:124 server: Shutting down TorchtunePostTrainingImpl WARNING 2025-03-10 20:56:43,979 __main__:129 server: No shutdown method for TorchtunePostTrainingImpl INFO 2025-03-10 20:56:43,979 __main__:124 server: Shutting down BenchmarksRoutingTable INFO 2025-03-10 20:56:43,980 __main__:124 server: Shutting down EvalRouter INFO 2025-03-10 20:56:43,981 __main__:124 server: Shutting down DistributionInspectImpl INFO: Application shutdown complete. INFO: Finished server process [33862] ``` [//]: # (## Documentation) --------- Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>	2025-03-11 10:30:55 -07:00
Ashwin Bharambe	e13c92f269	revert: feat(server): Use system packages for execution (#1551 ) Reverts meta-llama/llama-stack#1252 The above PR breaks the following invocation: ```bash llama stack run ~/.llama/distributions/together/together-run.yaml ```	2025-03-11 09:58:25 -07:00
Sébastien Han	21e39633d8	feat(server): Use system packages for execution (#1252 ) # What does this PR do? Users prefer to rely on the main CLI rather than invoking the server through a Python module. Users interact with a high-level CLI rather than needing to know internal module structures. Now, when running llama stack run <path-to-config>, the server will attempt to use the system package or a virtual environment if one is active. This also eliminates the current process dependency chain when running from a virtual environment: -> llama stack run        -> start_env.sh              -> python -m server... Signed-off-by: Sébastien Han <seb@redhat.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Run: ``` ollama run llama3.2:3b-instruct-fp16 --keepalive=2m & llama stack run ./llama_stack/templates/ollama/run.yaml --disable-ipv6 ``` Notice that the server starts and shutdowns normally. [//]: # (## Documentation) --------- Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-03-10 16:01:03 -07:00
ehhuang	0e3c0cf8de	fix: server logging (#1521 ) Summary: Test Plan: ERROR 2025-03-10 10:53:00,804 __main__:239 server: Error executing endpoint route='/v1/inference/chat-completion' method='post'	2025-03-10 15:25:23 -07:00
Ashwin Bharambe	205661bc78	fix: Use re-entrancy and concurrency safe context managers for provider data (#1498 ) Concurrent requests should not trample (or reuse) each others' provider data. Provider data should be scoped to each request. ## Test Plan Set the uvicorn server to have a single worker process + thread by updating the config: ```python uvicorn_config = { ... "workers": 1, "loop": "asyncio", } ``` Then perform the following steps on `origin/main` (without this change). (1) Run the server using `llama stack run dev` without having `FIREWORKS_API_KEY` in the environment. (2) Run a test by specifying the FIREWORKS_API_KEY env var so it gets stored in the thread local ``` pytest -s -v tests/integration/inference/test_text_inference.py \ --stack-config http://localhost:8321 \ --text-model accounts/fireworks/models/llama-v3p1-8b-instruct \ -k test_text_chat_completion_with_tool_calling_and_streaming \ --env FIREWORKS_API_KEY=<...> ``` Ensure you don't have any other API keys in the environment (otherwise the bug will not reproduce due to other specifics in our testing code.) Verify this works. (3) Run the same command again without specifying FIREWORKS_API_KEY. See that the request actually succeeds when it should have failed. ---- Now do the same tests on this branch, verify step (3) results in failure. Finally, run the full `test_text_inference.py` test suite with this change, verify it succeeds.	2025-03-08 22:56:30 -08:00
Sébastien Han	7cf1e24c4e	feat(logging): implement category-based logging (#1362 ) # What does this PR do? This commit introduces a new logging system that allows loggers to be assigned a category while retaining the logger name based on the file name. The log format includes both the logger name and the category, producing output like: ``` INFO 2025-03-03 21:44:11,323 llama_stack.distribution.stack:103 [core]: Tool_groups: builtin::websearch served by tavily-search ``` Key features include: - Category-based logging: Loggers can be assigned a category (e.g., "core", "server") when programming. The logger can be loaded like this: `logger = get_logger(name=__name__, category="server")` - Environment variable control: Log levels can be configured per-category using the `LLAMA_STACK_LOGGING` environment variable. For example: `LLAMA_STACK_LOGGING="server=DEBUG;core=debug"` enables DEBUG level for the "server" and "core" categories. - `LLAMA_STACK_LOGGING="all=debug"` sets DEBUG level globally for all categories and third-party libraries. This provides fine-grained control over logging levels while maintaining a clean and informative log format. The formatter uses the rich library which provides nice colors better stack traces like so: ``` ERROR 2025-03-03 21:49:37,124 asyncio:1758 [uncategorized]: unhandled exception during asyncio.run() shutdown task: <Task finished name='Task-16' coro=<handle_signal.<locals>.shutdown() done, defined at /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:146> exception=UnboundLocalError("local variable 'loop' referenced before assignment")> ╭────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────╮ │ /Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py:178 in shutdown │ │ │ │ 175 │ │ except asyncio.CancelledError: │ │ 176 │ │ │ pass │ │ 177 │ │ finally: │ │ ❱ 178 │ │ │ loop.stop() │ │ 179 │ │ │ 180 │ loop = asyncio.get_running_loop() │ │ 181 │ loop.create_task(shutdown()) │ ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ UnboundLocalError: local variable 'loop' referenced before assignment ``` Co-authored-by: Ashwin Bharambe <@ashwinb> Signed-off-by: Sébastien Han <seb@redhat.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan ``` python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml INFO 2025-03-03 21:55:35,918 __main__:365 [server]: Using config file: llama_stack/templates/ollama/run.yaml INFO 2025-03-03 21:55:35,925 __main__:378 [server]: Run configuration: INFO 2025-03-03 21:55:35,928 __main__:380 [server]: apis: - agents ``` [//]: # (## Documentation) --------- Signed-off-by: Sébastien Han <seb@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-03-07 11:34:30 -08:00
ehhuang	46bc5f4a7a	chore: log exception (#1452 ) Summary: Test Plan: <img width="1236" alt="image" src="https://github.com/user-attachments/assets/facc43ba-85ff-42e4-8e04-b7970c630c4d" />	2025-03-06 11:42:51 -08:00
Ashwin Bharambe	0a76ece249	feat: add more logs to agent_instance.py	2025-03-03 16:15:47 -08:00
Ashwin Bharambe	754feba61f	feat: add a configurable category-based logger (#1352 ) A self-respecting server needs good observability which starts with configurable logging. Llama Stack had little until now. This PR adds a `logcat` facility towards that. Callsites look like: ```python logcat.debug("inference", f"params to ollama: {params}") ``` - the first parameter is a category. there is a static list of categories in `llama_stack/logcat.py` - each category can be associated with a log-level which can be configured via the `LLAMA_STACK_LOGGING` env var. - a value `LLAMA_STACK_LOGGING=inference=debug;server=info"` does the obvious thing. there is a special key called `all` which is an alias for all categories ## Test Plan Ran with `LLAMA_STACK_LOGGING="all=debug" llama stack run fireworks` and saw the following: ![image](https://github.com/user-attachments/assets/d24b95ab-3941-426c-9ea0-a4c62542e6f0) Hit it with a client-sdk test case and saw this: ![image](https://github.com/user-attachments/assets/3fee8c6c-986e-4125-a09c-f5dc019682e2)	2025-03-02 18:51:14 -08:00
Sébastien Han	929c5f0842	refactor(server): replace print statements with logger (#1250 ) # What does this PR do? - Introduced logging in `StackRun` to replace print-based messages - Improved error handling for config file loading and parsing - Replaced `cprint` with `logger.error` for consistent error messaging - Ensured logging is used in `server.py` for startup, shutdown, and runtime messages - Added missing exception handling for invalid providers Signed-off-by: Sébastien Han <seb@redhat.com> Signed-off-by: Sébastien Han <seb@redhat.com>	2025-02-25 21:31:37 -08:00
ehhuang	1166afdf76	fix: some telemetry APIs don't currently work (#1188 ) Summary: This bug is surfaced by using the http LS client. The issue is that non-scalar values in 'GET' method are `body` params in fastAPI, but our spec generation script doesn't respect that. We fix by just making them POST method instead. Test Plan: Test API call with newly sync'd client (https://github.com/meta-llama/llama-stack-client-python/pull/149) <img width="1114" alt="image" src="https://github.com/user-attachments/assets/7710aca5-d163-4e00-a465-14e6fcaac2b2" />	2025-02-20 14:09:25 -08:00
Sébastien Han	e4a1579e63	build: format codebase imports using ruff linter (#1028 ) # What does this PR do? - Configured ruff linter to automatically fix import sorting issues. - Set --exit-non-zero-on-fix to ensure non-zero exit code when fixes are applied. - Enabled the 'I' selection to focus on import-related linting rules. - Ran the linter, and formatted all codebase imports accordingly. - Removed the black dep from the "dev" group since we use ruff Signed-off-by: Sébastien Han <seb@redhat.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) [//]: # (- [ ] Added a Changelog entry if the change is significant) Signed-off-by: Sébastien Han <seb@redhat.com>	2025-02-13 10:06:21 -08:00
Sébastien Han	418645696a	fix: improve signal handling and update dependencies (#1044 ) # What does this PR do? This commit enhances the signal handling mechanism in the server by improving the `handle_signal` (previously handle_sigint) function. It now properly retrieves the signal name, ensuring clearer logging when a termination signal is received. Additionally, it cancels all running tasks and waits for their completion before stopping the event loop, allowing for a more graceful shutdown. Support for handling SIGTERM has also been added alongside SIGINT. Before the changes, handle_sigint used asyncio.run(run_shutdown()). However, asyncio.run() is meant to start a new event loop, and calling it inside an existing one (like when running Uvicorn) raises an error. The fix replaces asyncio.run(run_shutdown()) with an async function scheduled on the existing loop using loop.create_task(shutdown()). This ensures that the shutdown coroutine runs within the current event loop instead of trying to create a new one. Furthermore, this commit updates the project dependencies. `fastapi` and `uvicorn` have been added to the development dependencies in `pyproject.toml` and `uv.lock`, ensuring that the necessary packages are available for development and execution. Closes: https://github.com/meta-llama/llama-stack/issues/1043 Signed-off-by: Sébastien Han <seb@redhat.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Run a server and send SIGINT: ``` INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml Using config file: llama_stack/templates/ollama/run.yaml Run configuration: apis: - agents - datasetio - eval - inference - safety - scoring - telemetry - tool_runtime - vector_io container_image: null datasets: [] eval_tasks: [] image_name: ollama metadata_store: db_path: /Users/leseb/.llama/distributions/ollama/registry.db namespace: null type: sqlite models: - metadata: {} model_id: meta-llama/Llama-3.2-3B-Instruct model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType - llm provider_id: ollama provider_model_id: null - metadata: embedding_dimension: 384 model_id: all-MiniLM-L6-v2 model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType - embedding provider_id: sentence-transformers provider_model_id: null providers: agents: - config: persistence_store: db_path: /Users/leseb/.llama/distributions/ollama/agents_store.db namespace: null type: sqlite provider_id: meta-reference provider_type: inline::meta-reference datasetio: - config: {} provider_id: huggingface provider_type: remote::huggingface - config: {} provider_id: localfs provider_type: inline::localfs eval: - config: {} provider_id: meta-reference provider_type: inline::meta-reference inference: - config: url: http://localhost:11434 provider_id: ollama provider_type: remote::ollama - config: {} provider_id: sentence-transformers provider_type: inline::sentence-transformers safety: - config: {} provider_id: llama-guard provider_type: inline::llama-guard scoring: - config: {} provider_id: basic provider_type: inline::basic - config: {} provider_id: llm-as-judge provider_type: inline::llm-as-judge - config: openai_api_key: '******' provider_id: braintrust provider_type: inline::braintrust telemetry: - config: service_name: llama-stack sinks: console,sqlite sqlite_db_path: /Users/leseb/.llama/distributions/ollama/trace_store.db provider_id: meta-reference provider_type: inline::meta-reference tool_runtime: - config: api_key: '****' max_results: 3 provider_id: brave-search provider_type: remote::brave-search - config: api_key: '******' max_results: 3 provider_id: tavily-search provider_type: remote::tavily-search - config: {} provider_id: code-interpreter provider_type: inline::code-interpreter - config: {} provider_id: rag-runtime provider_type: inline::rag-runtime vector_io: - config: kvstore: db_path: /Users/leseb/.llama/distributions/ollama/faiss_store.db namespace: null type: sqlite provider_id: faiss provider_type: inline::faiss scoring_fns: [] server: port: 8321 tls_certfile: null tls_keyfile: null shields: [] tool_groups: - args: null mcp_endpoint: null provider_id: tavily-search toolgroup_id: builtin::websearch - args: null mcp_endpoint: null provider_id: rag-runtime toolgroup_id: builtin::rag - args: null mcp_endpoint: null provider_id: code-interpreter toolgroup_id: builtin::code_interpreter vector_dbs: [] version: '2' INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:213: Resolved 31 providers INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-inference => ollama INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-inference => sentence-transformers INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: models => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inference => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-vector_io => faiss INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-safety => llama-guard INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: shields => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: safety => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: vector_dbs => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: vector_io => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-tool_runtime => brave-search INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-tool_runtime => tavily-search INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-tool_runtime => code-interpreter INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-tool_runtime => rag-runtime INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: tool_groups => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: tool_runtime => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: agents => meta-reference INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-datasetio => huggingface INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-datasetio => localfs INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: datasets => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: datasetio => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: telemetry => meta-reference INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-scoring => basic INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-scoring => llm-as-judge INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-scoring => braintrust INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: scoring_functions => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: scoring => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-eval => meta-reference INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: eval_tasks => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: eval => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inspect => __builtin__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:216: INFO 2025-02-12 10:21:03,723 llama_stack.providers.remote.inference.ollama.ollama:148: checking connectivity to Ollama at `http://localhost:11434`... INFO 2025-02-12 10:21:03,734 httpx:1740: HTTP Request: GET http://localhost:11434/api/ps "HTTP/1.1 200 OK" INFO 2025-02-12 10:21:03,843 faiss.loader:148: Loading faiss. INFO 2025-02-12 10:21:03,865 faiss.loader:150: Successfully loaded faiss. INFO 2025-02-12 10:21:03,868 faiss:173: Failed to load GPU Faiss: name 'GpuIndexIVFFlat' is not defined. Will not load constructor refs for GPU indexes. Warning: `bwrap` is not available. Code interpreter tool will not work correctly. INFO 2025-02-12 10:21:04,315 datasets:54: PyTorch version 2.6.0 available. INFO 2025-02-12 10:21:04,556 httpx:1740: HTTP Request: GET http://localhost:11434/api/ps "HTTP/1.1 200 OK" INFO 2025-02-12 10:21:04,557 llama_stack.providers.utils.inference.embedding_mixin:42: Loading sentence transformer for all-MiniLM-L6-v2... INFO 2025-02-12 10:21:07,202 sentence_transformers.SentenceTransformer:210: Use pytorch device_name: mps INFO 2025-02-12 10:21:07,202 sentence_transformers.SentenceTransformer:218: Load pretrained SentenceTransformer: all-MiniLM-L6-v2 INFO 2025-02-12 10:21:09,500 llama_stack.distribution.stack:102: Models: all-MiniLM-L6-v2 served by sentence-transformers INFO 2025-02-12 10:21:09,500 llama_stack.distribution.stack:102: Models: meta-llama/Llama-3.2-3B-Instruct served by ollama INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: basic::equality served by basic INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: basic::regex_parser_multiple_choice_answer served by basic INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: basic::subset_of served by basic INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::answer-correctness served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::answer-relevancy served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::answer-similarity served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::context-entity-recall served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::context-precision served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::context-recall served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::context-relevancy served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::factuality served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::faithfulness served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: llm-as-judge::405b-simpleqa served by llm-as-judge INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: llm-as-judge::base served by llm-as-judge INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Tool_groups: builtin::code_interpreter served by code-interpreter INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Tool_groups: builtin::rag served by rag-runtime INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Tool_groups: builtin::websearch served by tavily-search INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:106: Serving API eval POST /v1/eval/tasks/{task_id}/evaluations DELETE /v1/eval/tasks/{task_id}/jobs/{job_id} GET /v1/eval/tasks/{task_id}/jobs/{job_id}/result GET /v1/eval/tasks/{task_id}/jobs/{job_id} POST /v1/eval/tasks/{task_id}/jobs Serving API agents POST /v1/agents POST /v1/agents/{agent_id}/session POST /v1/agents/{agent_id}/session/{session_id}/turn DELETE /v1/agents/{agent_id} DELETE /v1/agents/{agent_id}/session/{session_id} GET /v1/agents/{agent_id}/session/{session_id} GET /v1/agents/{agent_id}/session/{session_id}/turn/{turn_id}/step/{step_id} GET /v1/agents/{agent_id}/session/{session_id}/turn/{turn_id} Serving API scoring_functions GET /v1/scoring-functions/{scoring_fn_id} GET /v1/scoring-functions POST /v1/scoring-functions Serving API safety POST /v1/safety/run-shield Serving API inspect GET /v1/health GET /v1/inspect/providers GET /v1/inspect/routes GET /v1/version Serving API tool_runtime POST /v1/tool-runtime/invoke GET /v1/tool-runtime/list-tools POST /v1/tool-runtime/rag-tool/insert POST /v1/tool-runtime/rag-tool/query Serving API datasetio POST /v1/datasetio/rows GET /v1/datasetio/rows Serving API shields GET /v1/shields/{identifier} GET /v1/shields POST /v1/shields Serving API eval_tasks GET /v1/eval-tasks/{eval_task_id} GET /v1/eval-tasks POST /v1/eval-tasks Serving API models GET /v1/models/{model_id} GET /v1/models POST /v1/models DELETE /v1/models/{model_id} Serving API datasets GET /v1/datasets/{dataset_id} GET /v1/datasets POST /v1/datasets DELETE /v1/datasets/{dataset_id} Serving API vector_io POST /v1/vector-io/insert POST /v1/vector-io/query Serving API inference POST /v1/inference/chat-completion POST /v1/inference/completion POST /v1/inference/embeddings Serving API tool_groups GET /v1/tools/{tool_name} GET /v1/toolgroups/{toolgroup_id} GET /v1/toolgroups GET /v1/tools POST /v1/toolgroups DELETE /v1/toolgroups/{toolgroup_id} Serving API vector_dbs GET /v1/vector-dbs/{vector_db_id} GET /v1/vector-dbs POST /v1/vector-dbs DELETE /v1/vector-dbs/{vector_db_id} Serving API scoring POST /v1/scoring/score POST /v1/scoring/score-batch Serving API telemetry GET /v1/telemetry/traces/{trace_id}/spans/{span_id} GET /v1/telemetry/spans/{span_id}/tree GET /v1/telemetry/traces/{trace_id} POST /v1/telemetry/events GET /v1/telemetry/spans GET /v1/telemetry/traces POST /v1/telemetry/spans/export Listening on ['::', '0.0.0.0']:5001 INFO: Started server process [65372] INFO: Waiting for application startup. INFO: ASGI 'lifespan' protocol appears unsupported. INFO: Application startup complete. INFO: Uvicorn running on http://['::', '0.0.0.0']:5001 (Press CTRL+C to quit) ^CINFO: Shutting down INFO: Finished server process [65372] Received signal SIGINT (2). Exiting gracefully... INFO 2025-02-12 10:21:11,215 __main__:151: Shutting down ModelsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down InferenceRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ShieldsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down SafetyRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down VectorDBsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down VectorIORouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ToolGroupsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ToolRuntimeRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down MetaReferenceAgentsImpl INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down DatasetsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down DatasetIORouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down TelemetryAdapter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ScoringFunctionsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ScoringRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down EvalTasksRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down EvalRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down DistributionInspectImpl ``` [//]: # (## Documentation) [//]: # (- [ ] Added a Changelog entry if the change is significant) Signed-off-by: Sébastien Han <seb@redhat.com>	2025-02-13 08:07:59 -08:00
Ashwin Bharambe	f8f2f7f9bb	feat: Add HTTPS serving option (#1000 ) # What does this PR do? Enables HTTPS option for Llama Stack. While doing so, introduces a `ServerConfig` sub-structure to house all server related configuration (port, ssl, etc.) Also simplified the `start_container.sh` entrypoint to simply be `python` instead of a complex bash command line. ## Test Plan Conda: Run: ```bash $ llama stack build --template together $ llama stack run --port 8322 # ensure server starts $ llama-stack-client configure --endpoint http://localhost:8322 $ llama-stack-client models list ``` Create a self-signed SSL key / cert pair. Then, using a local checkout of `llama-stack-client-python`, change https://github.com/meta-llama/llama-stack-client-python/blob/main/src/llama_stack_client/_base_client.py#L759 to add `kwargs.setdefault("verify", False)` so SSL verification is disabled. Then: ```bash $ llama stack run --port 8322 --tls-keyfile <KEYFILE> --tls-certfile <CERTFILE> $ llama-stack-client configure --endpoint https://localhost:8322 # notice the `https` $ llama-stack-client models list ``` Also tested with containers (but of course one needs to make sure the cert and key files are appropriately provided to the container.)	2025-02-07 09:39:08 -08:00
Yuan Tang	34ab7a3b6c	Fix precommit check after moving to ruff (#927 ) Lint check in main branch is failing. This fixes the lint check after we moved to ruff in https://github.com/meta-llama/llama-stack/pull/921. We need to move to a `ruff.toml` file as well as fixing and ignoring some additional checks. Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-02-02 06:46:45 -08:00
Ashwin Bharambe	9f14382d82	meta reference inference fixes (#797 ) Miscellaneous fixes for meta reference inference Tests for log probs dont pass because meta reference does not support top_k > 1	2025-01-16 18:17:46 -08:00
Ashwin Bharambe	cb41848a2a	disable version check optionally	2025-01-16 18:14:48 -08:00
Ashwin Bharambe	03ac84a829	Update default port from 5000 -> 8321	2025-01-16 15:26:48 -08:00
Dinesh Yeduguru	7fb2c1c48d	More idiomatic REST API (#765 ) # What does this PR do? This PR changes our API to follow more idiomatic REST API approaches of having paths being resources and methods indicating the action being performed. Changes made to generator: 1) removed the prefix check of "get" as its not required and is actually needed for other method types too 2) removed _ check on path since variables can have "_" ## Test Plan LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v tests/client-sdk/agents/test_agents.py	2025-01-15 13:20:09 -08:00
raghotham	ff182ff6de	rename LLAMASTACK_PORT to LLAMA_STACK_PORT for consistency with other env vars (#744 ) # What does this PR do? Rename environment var for consistency ## Test Plan No regressions ## Sources ## Before submitting - [X] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [X] Ran pre-commit to handle lint / formatting issues. - [X] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [X] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests. --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>	2025-01-10 11:09:49 -08:00
Ashwin Bharambe	4938f2fe5d	Check version incompatibility (#738 ) When we bump up `major.minor` we want to make sure clients can immediately detect a version change and appropriately error out. It is not reasonable to keep checking for API-level backwards compatibility across such version bumps. Over time, we will make the check based only on the major version perhaps. ### Test Plan Manually updated `__version__` in the client SDK to be "0.1.0" which is incompatible with server's current version "0.0.63", got the following error: <img width="1077" alt="image" src="https://github.com/user-attachments/assets/06ae4659-0a25-4c4c-a999-ce44678d4e6f" /> Without this update, the CLI worked correctly.	2025-01-09 14:52:06 -08:00
Yuan Tang	96d8375663	Fix incorrect entrypoint for broken `llama stack run` (#706 ) This fixes the issue when using `llama stack run` by correctly specifying entrypoint: ``` LLAMA_STACK_DIR=. llama stack run /home/yutang/.llama/distributions/llamastack-vllm/vllm-run.yaml Using config file: /home/yutang/.llama/distributions/llamastack-vllm/vllm-run.yaml + command -v selinuxenabled + selinuxenabled + DOCKER_OPTS=' --security-opt label=disable' + mounts= + '[' -n . ']' ++ readlink -f . + mounts=' -v /home/yutang/repos/llama-stack:/app/llama-stack-source' + '[' -n '' ']' + version_tag=latest + '[' -n '' ']' + '[' -n . ']' + version_tag=dev + podman run --security-opt label=disable -it -p 5000:5000 -v /home/yutang/.llama/distributions/llamastack-vllm/vllm-run.yaml:/app/config.yaml -v /home/yutang/repos/llama-stack:/app/llama-stack-source localhost/distribution-vllm:dev python -m llama_stack.distribution.server.server --yaml-config /app/config.yaml --port 5000 usage: server.py [-h] [--yaml-config YAML_CONFIG] [--template TEMPLATE] [--port PORT] [--disable-ipv6] [--env ENV] server.py: error: unrecognized arguments: python -m llama_stack.distribution.server.server ++ error_handler 88 ++ echo 'Error occurred in script at line: 88' Error occurred in script at line: 88 ++ exit 1 ``` --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-01-03 09:47:10 -08:00
Ashwin Bharambe	e3f187fb83	Redact sensitive information from configs when printing, etc.	2025-01-02 13:54:02 -08:00
Xi Yan	3c72c034e6	[remove import ] clean up import 's (#689 ) # What does this PR do? - as title, cleaning up `import `'s - upgrade tests to make them more robust to bad model outputs - remove import 's in llama_stack/apis/* (skip __init__ modules) <img width="465" alt="image" src="https://github.com/user-attachments/assets/d8339c13-3b40-4ba5-9c53-0d2329726ee2" /> - run `sh run_openapi_generator.sh`, no types gets affected ## Test Plan ### Providers Tests agents ``` pytest -v -s llama_stack/providers/tests/agents/test_agents.py -m "together" --safety-shield meta-llama/Llama-Guard-3-8B --inference-model meta-llama/Llama-3.1-405B-Instruct-FP8 ``` inference ```bash # meta-reference torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py # together pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.2-11B-Vision-Instruct" ./llama_stack/providers/tests/inference/test_vision_inference.py pytest ./llama_stack/providers/tests/inference/test_prompt_adapter.py ``` safety ``` pytest -v -s llama_stack/providers/tests/safety/test_safety.py -m together --safety-shield meta-llama/Llama-Guard-3-8B ``` memory ``` pytest -v -s llama_stack/providers/tests/memory/test_memory.py -m "sentence_transformers" --env EMBEDDING_DIMENSION=384 ``` scoring ``` pytest -v -s -m llm_as_judge_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py --judge-model meta-llama/Llama-3.2-3B-Instruct pytest -v -s -m basic_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py pytest -v -s -m braintrust_scoring_together_inference llama_stack/providers/tests/scoring/test_scoring.py ``` datasetio ``` pytest -v -s -m localfs llama_stack/providers/tests/datasetio/test_datasetio.py pytest -v -s -m huggingface llama_stack/providers/tests/datasetio/test_datasetio.py ``` eval ``` pytest -v -s -m meta_reference_eval_together_inference llama_stack/providers/tests/eval/test_eval.py pytest -v -s -m meta_reference_eval_together_inference_huggingface_datasetio llama_stack/providers/tests/eval/test_eval.py ``` ### Client-SDK Tests ``` LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v ./tests/client-sdk ``` ### llama-stack-apps ``` PORT=5000 LOCALHOST=localhost python -m examples.agents.hello $LOCALHOST $PORT python -m examples.agents.inflation $LOCALHOST $PORT python -m examples.agents.podcast_transcript $LOCALHOST $PORT python -m examples.agents.rag_as_attachments $LOCALHOST $PORT python -m examples.agents.rag_with_memory_bank $LOCALHOST $PORT python -m examples.safety.llama_guard_demo_mm $LOCALHOST $PORT python -m examples.agents.e2e_loop_with_custom_tools $LOCALHOST $PORT # Vision model python -m examples.interior_design_assistant.app python -m examples.agent_store.app $LOCALHOST $PORT ``` ### CLI ``` which llama llama model prompt-format -m Llama3.2-11B-Vision-Instruct llama model list llama stack list-apis llama stack list-providers inference llama stack build --template ollama --image-type conda ``` ### Distributions Tests ollama ``` llama stack build --template ollama --image-type conda ollama run llama3.2:1b-instruct-fp16 llama stack run ./llama_stack/templates/ollama/run.yaml --env INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct ``` fireworks ``` llama stack build --template fireworks --image-type conda llama stack run ./llama_stack/templates/fireworks/run.yaml ``` together ``` llama stack build --template together --image-type conda llama stack run ./llama_stack/templates/together/run.yaml ``` tgi ``` llama stack run ./llama_stack/templates/tgi/run.yaml --env TGI_URL=http://0.0.0.0:5009 --env INFERENCE_MODEL=meta-llama/Llama-3.1-8B-Instruct ``` ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-12-27 15:45:44 -08:00
Xi Yan	c699e884b5	fix telemetry import (#585 ) # What does this PR do? fix issue <img width="921" alt="image" src="https://github.com/user-attachments/assets/26f7499f-fae1-4c93-9de3-1ae7ee7c5144"> ## Test Plan ``` llama stack run ``` <img width="657" alt="image" src="https://github.com/user-attachments/assets/266b6ac2-f991-4b38-841c-2a610b7d9f0f"> ## Sources Please link relevant resources if necessary. ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Ran pre-commit to handle lint / formatting issues. - [ ] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests.	2024-12-09 11:18:53 -08:00
Ashwin Bharambe	084ec337af	Small cleanup of console logs	2024-12-06 10:29:24 -08:00
Dinesh Yeduguru	fcd6449519	Telemetry API redesign (#525 ) # What does this PR do? Change the Telemetry API to be able to support different use cases like returning traces for the UI and ability to export for Evals. Other changes: * Add a new trace_protocol decorator to decorate all our API methods so that any call to them will automatically get traced across all impls. * There is some issue with the decorator pattern of span creation when using async generators, where there are multiple yields with in the same context. I think its much more explicit by using the explicit context manager pattern using with. I moved the span creations in agent instance to be using with * Inject session id at the turn level, which should quickly give us all traces across turns for a given session Addresses #509 ## Test Plan ``` llama stack run /Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml PYTHONPATH=. python -m examples.agents.rag_with_memory_bank localhost 5000 curl -X POST 'http://localhost:5000/alpha/telemetry/query-traces' \ -H 'Content-Type: application/json' \ -d '{ "attribute_filters": [ { "key": "session_id", "op": "eq", "value": "dd667b87-ca4b-4d30-9265-5a0de318fc65" }], "limit": 100, "offset": 0, "order_by": ["start_time"] }' \| jq . [ { "trace_id": "6902f54b83b4b48be18a6f422b13e16f", "root_span_id": "5f37b85543afc15a", "start_time": "2024-12-04T08:08:30.501587", "end_time": "2024-12-04T08:08:36.026463" }, { "trace_id": "92227dac84c0615ed741be393813fb5f", "root_span_id": "af7c5bb46665c2c8", "start_time": "2024-12-04T08:08:36.031170", "end_time": "2024-12-04T08:08:41.693301" }, { "trace_id": "7d578a6edac62f204ab479fba82f77b6", "root_span_id": "1d935e3362676896", "start_time": "2024-12-04T08:08:41.695204", "end_time": "2024-12-04T08:08:47.228016" }, { "trace_id": "dbd767d76991bc816f9f078907dc9ff2", "root_span_id": "f5a7ee76683b9602", "start_time": "2024-12-04T08:08:47.234578", "end_time": "2024-12-04T08:08:53.189412" } ] curl -X POST 'http://localhost:5000/alpha/telemetry/get-span-tree' \ -H 'Content-Type: application/json' \ -d '{ "span_id" : "6cceb4b48a156913", "max_depth": 2, "attributes_to_return": ["input"] }' \| jq . % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 875 100 790 100 85 18462 1986 --:--:-- --:--:-- --:--:-- 20833 { "span_id": "6cceb4b48a156913", "trace_id": "dafa796f6aaf925f511c04cd7c67fdda", "parent_span_id": "892a66d726c7f990", "name": "retrieve_rag_context", "start_time": "2024-12-04T09:28:21.781995", "end_time": "2024-12-04T09:28:21.913352", "attributes": { "input": [ "{\"role\":\"system\",\"content\":\"You are a helpful assistant\"}", "{\"role\":\"user\",\"content\":\"What are the top 5 topics that were explained in the documentation? Only list succinct bullet points.\",\"context\":null}" ] }, "children": [ { "span_id": "1a2df181854064a8", "trace_id": "dafa796f6aaf925f511c04cd7c67fdda", "parent_span_id": "6cceb4b48a156913", "name": "MemoryRouter.query_documents", "start_time": "2024-12-04T09:28:21.787620", "end_time": "2024-12-04T09:28:21.906512", "attributes": { "input": null }, "children": [], "status": "ok" } ], "status": "ok" } ``` <img width="1677" alt="Screenshot 2024-12-04 at 9 42 56 AM" src="https://github.com/user-attachments/assets/4d3cea93-05ce-415a-93d9-4b1628631bf8">	2024-12-04 11:22:45 -08:00
Dinesh Yeduguru	501e7c9d64	Fix opentelemetry adapter (#510 ) # What does this PR do? This PR fixes some of the issues with our telemetry setup to enable logs to be delivered to opentelemetry and jaeger. Main fixes 1) Updates the open telemetry provider to use the latest oltp exports instead of deprected ones. 2) Adds a tracing middleware, which injects traces into each HTTP request that the server recieves and this is going to be the root trace. Previously, we did this in the create_dynamic_route method, which is actually not the actual exectuion flow, but more of a config and this causes the traces to end prematurely. Through middleware, we plugin the trace start and end at the right location. 3) We manage our own methods to create traces and spans and this does not fit well with Opentelemetry SDK since it does not support provide a way to take in traces and spans that are already created. it expects us to use the SDK to create them. For now, I have a hacky approach of just maintaining a map from our internal telemetry objects to the open telemetry specfic ones. This is not the ideal solution. I will explore other ways to get around this issue. for now, to have something that works, i am going to keep this as is. Addresses: #509	2024-11-22 18:18:11 -08:00
Dinesh Yeduguru	6395dadc2b	use logging instead of prints (#499 ) # What does this PR do? This PR moves all print statements to use logging. Things changed: - Had to add `await start_trace("sse_generator")` to server.py to actually get tracing working. else was not seeing any logs - If no telemetry provider is provided in the run.yaml, we will write to stdout - by default, the logs are going to be in JSON, but we expose an option to configure to output in a human readable way.	2024-11-21 11:32:53 -08:00
Ashwin Bharambe	681322731b	Make run yaml optional so dockers can start with just --env (#492 ) When running with dockers, the idea is that users be able to work purely with the `llama stack` CLI. They should not need to know about the existence of any YAMLs unless they need to. This PR enables it. The docker command now doesn't need to volume mount a yaml and can simply be: ```bash docker run -v ~/.llama/:/root/.llama \ --env A=a --env B=b ``` ## Test Plan Check with conda first (no regressions): ```bash LLAMA_STACK_DIR=. llama stack build --template ollama llama stack run ollama --port 5001 # server starts up correctly ``` Check with docker ```bash # build the docker LLAMA_STACK_DIR=. llama stack build --template ollama --image-type docker export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" docker run -it -p 5001:5001 \ -v ~/.llama:/root/.llama \ -v $PWD:/app/llama-stack-source \ localhost/distribution-ollama:dev \ --port 5001 \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ --env OLLAMA_URL=http://host.docker.internal:11434 ``` Note that volume mounting to `/app/llama-stack-source` is only needed because we built the docker with uncommitted source code.	2024-11-20 13:11:40 -08:00
Dinesh Yeduguru	fe19076838	get stack run config based on template name (#477 ) This PR adds a method in stack to return the stackrunconfig object based on the template name. This will be used to instantiate a direct client without the need for an explicit run.yaml --------- Co-authored-by: Dinesh Yeduguru <dineshyv@fb.com>	2024-11-18 18:05:05 -08:00
Ashwin Bharambe	fb15ff4a97	Move to use argparse, fix issues with multiple --env cmdline options	2024-11-18 16:31:59 -08:00
Ashwin Bharambe	b87f3ac499	Allow server to accept --env key pairs	2024-11-18 16:17:59 -08:00
Ashwin Bharambe	2a31163178	Auto-generate distro yamls + docs (#468 ) # What does this PR do? Automatically generates - build.yaml - run.yaml - run-with-safety.yaml - parts of markdown docs for the distributions. ## Test Plan At this point, this only updates the YAMLs and the docs. Some testing (especially with ollama and vllm) has been performed but needs to be much more tested.	2024-11-18 14:57:06 -08:00
Ashwin Bharambe	20bf2f50c2	No more model_id warnings	2024-11-15 12:20:18 -08:00
Ashwin Bharambe	7f6ac2fbd7	allow seeing warnings with traces optionally	2024-11-13 12:27:19 -08:00
Ashwin Bharambe	96e7ef646f	add support for ${env.FOO_BAR} placeholders in run.yaml files (#439 ) # What does this PR do? We'd like our docker steps to require _ZERO EDITS_ to a YAML file in order to get going. This is often not possible because depending on the provider, we do need some configuration input from the user. Environment variables are the best way to obtain this information. This PR allows our run.yaml to contain `${env.FOO_BAR}` placeholders which can be replaced using `docker run -e FOO_BAR=baz` (and similar `docker compose` equivalent). ## Test Plan For remote-vllm, example `run.yaml` snippet looks like this: ```yaml providers: inference: # serves main inference model - provider_id: vllm-0 provider_type: remote::vllm config: # NOTE: replace with "localhost" if you are running in "host" network mode url: ${env.LLAMA_INFERENCE_VLLM_URL:http://host.docker.internal:5100/v1} max_tokens: ${env.MAX_TOKENS:4096} api_token: fake # serves safety llama_guard model - provider_id: vllm-1 provider_type: remote::vllm config: # NOTE: replace with "localhost" if you are running in "host" network mode url: ${env.LLAMA_SAFETY_VLLM_URL:http://host.docker.internal:5101/v1} max_tokens: ${env.MAX_TOKENS:4096} api_token: fake ``` `compose.yaml` snippet looks like this: ```yaml llamastack: depends_on: - vllm-0 - vllm-1 # image: llamastack/distribution-remote-vllm image: llamastack/distribution-remote-vllm:test-0.0.52rc3 volumes: - ~/.llama:/root/.llama - ~/local/llama-stack/distributions/remote-vllm/run.yaml:/root/llamastack-run-remote-vllm.yaml # network_mode: "host" environment: - LLAMA_INFERENCE_VLLM_URL=${LLAMA_INFERENCE_VLLM_URL:-http://host.docker.internal:5100/v1} - LLAMA_INFERENCE_MODEL=${LLAMA_INFERENCE_MODEL:-Llama3.1-8B-Instruct} - MAX_TOKENS=${MAX_TOKENS:-4096} - SQLITE_STORE_DIR=${SQLITE_STORE_DIR:-$HOME/.llama/distributions/remote-vllm} - LLAMA_SAFETY_VLLM_URL=${LLAMA_SAFETY_VLLM_URL:-http://host.docker.internal:5101/v1} - LLAMA_SAFETY_MODEL=${LLAMA_SAFETY_MODEL:-Llama-Guard-3-1B} ```	2024-11-13 11:25:58 -08:00
Ashwin Bharambe	12947ac19e	Kill "remote" providers and fix testing with a remote stack properly (#435 ) # What does this PR do? This PR kills the notion of "pure passthrough" remote providers. You cannot specify a single provider you must specify a whole distribution (stack) as remote. This PR also significantly fixes / upgrades testing infrastructure so you can now test against a remotely hosted stack server by just doing ```bash pytest -s -v -m remote test_agents.py \ --inference-model=Llama3.1-8B-Instruct --safety-shield=Llama-Guard-3-1B \ --env REMOTE_STACK_URL=http://localhost:5001 ``` Also fixed `test_agents_persistence.py` (which was broken) and killed some deprecated testing functions. ## Test Plan All the tests.	2024-11-12 21:51:29 -08:00
Ashwin Bharambe	d9d271a684	Allow specifying resources in StackRunConfig (#425 ) # What does this PR do? This PR brings back the facility to not force registration of resources onto the user. This is not just annoying but actually not feasible sometimes. For example, you may have a Stack which boots up with private providers for inference for models A and B. There is no way for the user to actually know which model is being served by these providers now (to be able to register it.) How will this avoid the users needing to do registration? In a follow-up diff, I will make sure I update the sample run.yaml files so they list the models served by the distributions explicitly. So when users do `llama stack build --template <...>` and run it, their distributions come up with the right set of models they expect. For self-hosted distributions, it also allows us to have a place to explicit list the models that need to be served to make the "complete" stack (including safety, e.g.) ## Test Plan Started ollama locally with two lightweight models: Llama3.2-3B-Instruct and Llama-Guard-3-1B. Updated all the tests including agents. Here's the tests I ran so far: ```bash pytest -s -v -m "fireworks and llama_3b" test_text_inference.py::TestInference \ --env FIREWORKS_API_KEY=... pytest -s -v -m "ollama and llama_3b" test_text_inference.py::TestInference pytest -s -v -m ollama test_safety.py pytest -s -v -m faiss test_memory.py pytest -s -v -m ollama test_agents.py \ --inference-model=Llama3.2-3B-Instruct --safety-model=Llama-Guard-3-1B ``` Found a few bugs here and there pre-existing that these test runs fixed.	2024-11-12 10:58:49 -08:00
Ashwin Bharambe	c1f7ba3aed	Split safety into (llama-guard, prompt-guard, code-scanner) (#400 ) Splits the meta-reference safety implementation into three distinct providers: - inline::llama-guard - inline::prompt-guard - inline::code-scanner Note that this PR is a backward incompatible change to the llama stack server. I have added deprecation_error field to ProviderSpec -- the server reads it and immediately barfs. This is used to direct the user with a specific message on what action to perform. An automagical "config upgrade" is a bit too much work to implement right now :/ (Note that we will be gradually prefixing all inline providers with inline:: -- I am only doing this for this set of new providers because otherwise existing configuration files will break even more badly.)	2024-11-11 09:29:18 -08:00
Dalton Flanagan	345ae07317	Factor out create_dist_registry (#398 )	2024-11-07 16:13:19 -05:00
Ashwin Bharambe	d289afdbde	Fix exception in server when client SSE connection closes	2024-11-06 11:00:34 -08:00

1 2

69 commits