Commit graph

916 commits

Author SHA1 Message Date
Ihar Hrachyshka
77ca09467f
chore: consolidate scripts under ./scripts directory (#1646) 2025-03-17 17:56:30 -04:00
Nathan Weinberg
e48af78b76
fix: add shutdown method for ProviderImpl (#1670)
# What does this PR do?
Currently there is no shutdown method implemented for the `ProviderImpl`
class

This leads to the following warning
```shell
INFO:     Waiting for application shutdown.
INFO     2025-03-17 17:25:13,280 __main__:145 server: Shutting down                                                     
INFO     2025-03-17 17:25:13,282 __main__:129 server: Shutting down ModelsRoutingTable                                  
INFO     2025-03-17 17:25:13,284 __main__:129 server: Shutting down DatasetsRoutingTable                                
INFO     2025-03-17 17:25:13,286 __main__:129 server: Shutting down DatasetIORouter                                     
INFO     2025-03-17 17:25:13,287 __main__:129 server: Shutting down TelemetryAdapter                                    
INFO     2025-03-17 17:25:13,288 __main__:129 server: Shutting down InferenceRouter                                     
INFO     2025-03-17 17:25:13,290 __main__:129 server: Shutting down ShieldsRoutingTable                                 
INFO     2025-03-17 17:25:13,291 __main__:129 server: Shutting down SafetyRouter                                        
INFO     2025-03-17 17:25:13,292 __main__:129 server: Shutting down VectorDBsRoutingTable                               
INFO     2025-03-17 17:25:13,293 __main__:129 server: Shutting down VectorIORouter                                      
INFO     2025-03-17 17:25:13,294 __main__:129 server: Shutting down ToolGroupsRoutingTable                              
INFO     2025-03-17 17:25:13,295 __main__:129 server: Shutting down ToolRuntimeRouter                                   
INFO     2025-03-17 17:25:13,296 __main__:129 server: Shutting down MetaReferenceAgentsImpl                             
INFO     2025-03-17 17:25:13,297 __main__:129 server: Shutting down ScoringFunctionsRoutingTable                        
INFO     2025-03-17 17:25:13,298 __main__:129 server: Shutting down ScoringRouter                                       
INFO     2025-03-17 17:25:13,299 __main__:129 server: Shutting down BenchmarksRoutingTable                              
INFO     2025-03-17 17:25:13,300 __main__:129 server: Shutting down EvalRouter                                          
INFO     2025-03-17 17:25:13,301 __main__:129 server: Shutting down DistributionInspectImpl                             
INFO     2025-03-17 17:25:13,303 __main__:129 server: Shutting down ProviderImpl                                        
WARNING  2025-03-17 17:25:13,304 __main__:134 server: No shutdown method for ProviderImpl                               
INFO:     Application shutdown complete.
INFO:     Finished server process [1]
```

## Test Plan
Start a server and shut it down

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
2025-03-17 14:55:40 -07:00
cdgamarose-nv
252a487085
feat: added nvidia as safety provider (#1248)
# What does this PR do?
Adds nvidia as a safety provider by interfacing with the nemo guardrails
microservice.
This enables checking user’s input or the LLM’s output against input and
output guardrails by using the `/v1/guardrails/checks` endpoint of the[
guardrails
API.](https://developer.nvidia.com/docs/nemo-microservices/guardrails/source/guides/checks-guide.html)

## Test Plan
Deploy nemo guardrails service following the documentation:
https://developer.nvidia.com/docs/nemo-microservices/guardrails/source/getting-started/deploy-docker.html

### Standalone:
```bash
(venv) local-cdgamarose@a1u1g-rome-0153:~/llama-stack$ pytest -v -s llama_stack/providers/tests/safety/test_safety.py --providers inference=nvidia,safety=nvidia --safety-shield meta/llama-3.1-8b-instruct

=================================================================================== test session starts ===================================================================================
platform linux -- Python 3.10.12, pytest-8.3.4, pluggy-1.5.0 -- /localhome/local-cdgamarose/llama-stack/venv/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.10.12', 'Platform': 'Linux-5.15.0-122-generic-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0', 'html': '4.1.1'}}
rootdir: /localhome/local-cdgamarose/llama-stack
configfile: pyproject.toml
plugins: metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0, html-4.1.1
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 2 items

llama_stack/providers/tests/safety/test_safety.py::TestSafety::test_shield_list[--inference=nvidia:safety=nvidia] Initializing NVIDIASafetyAdapter(http://0.0.0.0:7331)...
PASSED
llama_stack/providers/tests/safety/test_safety.py::TestSafety::test_run_shield[--inference=nvidia:safety=nvidia] PASSED

============================================================================== 2 passed, 2 warnings in 4.78s ==============================================================================

```
### Distribution:
```
llama stack run llama_stack/templates/nvidia/run-with-safety.yaml
curl -v -X 'POST' "http://localhost:8321/v1/safety/run-shield" -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"shield_id": "meta/llama-3.1-8b-instruct", "messages":[{"role": "user", "content": "you are stupid"}]}'
{"violation":{"violation_level":"error","user_message":"Sorry I cannot do this.","metadata":{"self check input":{"status":"blocked"}}}}
```

[//]: # (## Documentation)

---------

Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-03-17 14:39:23 -07:00
Jeff MAURY
f11b6db40d
fix: build distribution with podman (#1671)
# What does this PR do?

Update the container build script so that it is compatible with podman.
The --progress=plain is now the default option and can be overriden.

## Test Plan
N/A

[//]: # (## Documentation)

Signed-off-by: Jeff MAURY <jmaury@redhat.com>
2025-03-17 14:30:06 -07:00
Sarthak Deshpande
dfa11a1216
fix: fixed import error (#1637)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
The generate_response_prompt had an import error, fixed that error.

Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>
2025-03-17 17:04:47 -04:00
yyymeta
fb418813fc
fix: passthrough impl response.content.text (#1665)
# What does this PR do?
current passthrough impl returns chatcompletion_message.content as a
TextItem() , not a straight string. so it's not compatible with other
providers, and causes parsing error downstream.

change away from the generic pydantic conversion, and explicitly parse
out content.text

## Test Plan

setup llama server with passthrough

```
llama-stack-client eval run-benchmark "MMMU_Pro_standard"   --model-id    meta-llama/Llama-3-8B   --output-dir /tmp/   --num-examples 20
```
works without parsing error
2025-03-17 13:42:08 -07:00
yyymeta
a626b7bce3
feat: [new open benchmark] BFCL_v3 (#1578)
# What does this PR do?
create a new dataset BFCL_v3 from
https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html

overall each question asks the model to perform a task described in
natural language, and additionally a set of available functions and
their schema are given for the model to choose from. the model is
required to write the function call form including function name and
parameters , to achieve the stated purpose. the results are validated
against provided ground truth, to make sure that the generated function
call and the ground truth function call are syntactically and
semantically equivalent, by checking their AST .



## Test Plan

start server by 

```
llama stack run ./llama_stack/templates/ollama/run.yaml
```

then send traffic
```
 llama-stack-client eval run-benchmark "bfcl"  --model-id   meta-llama/Llama-3.2-3B-Instruct    --output-dir /tmp/gpqa    --num-examples   2
```




[//]: # (## Documentation)
2025-03-14 12:50:49 -07:00
Charlie Doern
78d4872c0c
feat: add support for logging config in the run.yaml (#1408)
# What does this PR do?

a user should be able to store a static logging configuration outside of
their environment. This would make sense to store in the run yaml given
that we store other things like server configuration in there.

The environment variable settings override the config settings if both
are available.

The format in the config looks like this:

```
logging_config:
  category_levels:
    VALID_CATEGORY: VALID_STRING_LOG_LEVEL
```

any specified category out of the following:

`core | server | router | inference | agents | safety | eval | tools |
client`

combined with any of the following log levels:

`debug | info | warning | error | critical`

can be placed in the category_levels list in order to achieve the
desired log level

## Test Plan

Test locally with a run config like the following:

```
version: '2'
image_name: ollama
logging_config:
  category_levels:
      server: debug
apis:
...
```

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-03-14 12:36:25 -07:00
Xi Yan
33b096cc21
fix: OpenAPI with provider get (#1627)
# What does this PR do?
- https://github.com/meta-llama/llama-stack/pull/1429 introduces
GetProviderResponse in OpenAPI, which is not needed, and not correctly
defined.

cc @cdoern 


[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
llama-stack-client providers list
```
<img width="610" alt="image"
src="https://github.com/user-attachments/assets/2f7b62a5-daf2-4bf9-9505-69755c7025fc"
/>


[//]: # (## Documentation)
2025-03-13 19:56:32 -07:00
Yuan Tang
ca0cbf4338
fix: Fix pre-commit check (#1628)
# What does this PR do?

Fixes pre-commit check failure after merging
https://github.com/meta-llama/llama-stack/pull/1010:
3874877097

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-13 18:57:42 -07:00
Alina Ryan
c02464b635
fix: Clarify llama model prompt-format help text (#1010)
# What does this PR do?
Updates the help text for the `llama model prompt-format` command to
clarify that users should provide a specific model name (e.g.,
Llama3.1-8B, Llama3.2-11B-Vision), not a model family. Removes the
default value and field for `--model-name` to prevent users from
mistakenly thinking a model family name is acceptable. Adds guidance to
run `llama model list` to view valid model names.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Output of `llama model prompt-format -h` Before:
```
(venv) alina@fedora:~/dev/llama/llama-stack$ llama model prompt-format -h
usage: llama model prompt-format [-h] [-m MODEL_NAME]

Show llama model message formats

options:
  -h, --help            show this help message and exit
  -m MODEL_NAME, --model-name MODEL_NAME
                        Model Family (llama3_1, llama3_X, etc.)

Example:
    llama model prompt-format <options>
(venv) alina@fedora:~/dev/llama/llama-stack$ llama model prompt-format --model-name llama3_1
usage: llama model prompt-format [-h] [-m MODEL_NAME]
llama model prompt-format: error: llama3_1 is not a valid Model. Choose one from --
Llama3.1-8B
Llama3.1-70B
Llama3.1-405B
Llama3.1-8B-Instruct
Llama3.1-70B-Instruct
Llama3.1-405B-Instruct
Llama3.2-1B
Llama3.2-3B
Llama3.2-1B-Instruct
Llama3.2-3B-Instruct
Llama3.2-11B-Vision
Llama3.2-90B-Vision
Llama3.2-11B-Vision-Instruct
Llama3.2-90B-Vision-Instruct
```

Output of `llama model prompt-format -h` After:
```
(venv) alina@fedora:~/dev/llama/llama-stack$ llama model prompt-format -h
usage: llama model prompt-format [-h] [-m MODEL_NAME]

Show llama model message formats

options:
  -h, --help            show this help message and exit
  -m MODEL_NAME, --model-name MODEL_NAME
                        Example: Llama3.1-8B or Llama3.2-11B-Vision, etc
                        (Run `llama model list` to see a list of valid model names)

Example:
    llama model prompt-format <options>

```

Signed-off-by: Alina Ryan <aliryan@redhat.com>
2025-03-13 20:47:09 -04:00
Sébastien Han
98b1b15e0f
refactor: move all datetime.now() calls to UTC (#1589)
# What does this PR do?

Updated all instances of datetime.now() to use timezone.utc for
consistency in handling time across different systems. This ensures that
timestamps are always in Coordinated Universal Time (UTC), avoiding
issues with time zone discrepancies and promoting uniformity in
time-related data.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-13 15:34:53 -07:00
Charlie Doern
a062723d03
feat: add provider API for listing and inspecting provider info (#1429)
# What does this PR do?

currently the `inspect` API for providers is really a `list` API. Create
a new `providers` API which has a GET `providers/{provider_id}` inspect
API
which returns "user friendly" configuration to the end user. Also add a
GET `/providers` endpoint which returns the list of providers as
`inspect/providers` does today.

This API follows CRUD and is more intuitive/RESTful.

This work is part of the RFC at
https://github.com/meta-llama/llama-stack/pull/1359

sensitive fields are redacted using `redact_sensetive_fields` on the
server side before returning a response:

<img width="456" alt="Screenshot 2025-03-13 at 4 40 21 PM"
src="https://github.com/user-attachments/assets/9465c221-2a26-42f8-a08a-6ac4a9fecce8"
/>


## Test Plan

using https://github.com/meta-llama/llama-stack-client-python/pull/181 a
user is able to to run the following:

`llama stack build --template ollama --image-type venv`
`llama stack run --image-type venv
~/.llama/distributions/ollama/ollama-run.yaml`
`llama-stack-client providers inspect ollama`

<img width="378" alt="Screenshot 2025-03-13 at 4 39 35 PM"
src="https://github.com/user-attachments/assets/8273d05d-8bc3-44c6-9e4b-ef95e48d5466"
/>


also, was able to run the new test_list integration test locally with
ollama:

<img width="1509" alt="Screenshot 2025-03-13 at 11 03 40 AM"
src="https://github.com/user-attachments/assets/9b9db166-f02f-45b0-86a4-306d85149bc8"
/>

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-03-13 15:07:21 -07:00
Xi Yan
9617468d13
fix: passthrough provider template + fix (#1612)
# What does this PR do?

- Fix issue w/ passthrough provider


[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
llama stack run

[//]: # (## Documentation)
2025-03-13 09:44:26 -07:00
Ashwin Bharambe
d072b5fa0c
test: add unit test to ensure all config types are instantiable (#1601) 2025-03-12 22:29:58 -07:00
ehhuang
a505bf45a3
feat(api): remove tool_name from ToolResponseMessage (#1599)
Summary:
This is not used anywhere.

closes #1421 

Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct --record-responses
2025-03-12 19:41:48 -07:00
ehhuang
6bfcb65343
test: code exec on mac (#1549)
Summary:
1. adds option to not use bwrap for code execution
2. disable bwrap when running tests on macs

Test Plan:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/integration/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct
```

Verify code_interpreter result in logs

INFO 2025-03-11 08:10:39,858
llama_stack.providers.inline.agents.meta_reference.agent_instance:1032
agents: tool
call code_interpreter completed with result:
content='completed\n\n541\n' error_message=None error_code=None
         metadata=None
2025-03-12 19:21:53 -07:00
ehhuang
ed6caead72
chore: simplify _get_tool_defs (#1384)
Summary:

Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
2025-03-12 18:51:18 -07:00
ehhuang
41c9bca1aa
chore: refactor Agent toolgroup processing (#1381)
Summary:
Refactoring only.

Centralize logic to preprocess toolgroup to one place. 

Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/api/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1381).
* #1384
* __->__ #1381
2025-03-12 18:48:03 -07:00
Dinesh Yeduguru
99bbe0e70b
feat: Add new compact MetricInResponse type (#1593)
# What does this PR do?
This change adds a compact type to include metrics in response as
opposed to the full MetricEvent which is relevant for internal logging
purposes.

## Test Plan
```
LLAMA_STACK_CONFIG=~/.llama/distributions/fireworks/fireworks-run.yaml pytest -s -v agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct

 llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml

curl --request POST \
  --url http://localhost:8321/v1/inference/chat-completion \
  --header 'content-type: application/json' \
  --data '{
  "model_id": "meta-llama/Llama-3.1-70B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": {
        "type": "text",
        "text": "where do humans live"
      }
    }
  ],
  "stream": false
}'

{
  "metrics": [
    {
      "metric": "prompt_tokens",
      "value": 10,
      "unit": null
    },
    {
      "metric": "completion_tokens",
      "value": 522,
      "unit": null
    },
    {
      "metric": "total_tokens",
      "value": 532,
      "unit": null
    }
  ],
  "completion_message": {
    "role": "assistant",
    "content": "Humans live in various parts of the world...............",
    "stop_reason": "out_of_tokens",
    "tool_calls": []
  },
  "logprobs": null
}
```
2025-03-12 15:45:44 -07:00
ehhuang
1311faf3f5
fix: logging (#1598)
Summary:

Test Plan:
2025-03-12 14:57:31 -07:00
Dinesh Yeduguru
0fdb15bcc7
fix: fix build error in context.py (#1595)
# What does this PR do?
This fixes the build error


## Test Plan
pre-commit run --all-files
check for merge
conflicts................................................Passed
trim trailing
whitespace.................................................Passed
check for added large
files..............................................Passed
fix end of
files.........................................................Passed
Insert license in
comments...............................................Passed

ruff.....................................................................Passed

ruff-format..............................................................Passed

blacken-docs.............................................................Passed

uv-lock..................................................................Passed

uv-export................................................................Passed

mypy.....................................................................Passed
Distribution Template
Codegen............................................Passed
2025-03-12 13:26:23 -07:00
ehhuang
b7a9c45477
chore: deprecate ToolResponseMessage in agent.resume API (#1566)
# Summary:
closes #1431 

# Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
2025-03-12 12:10:21 -07:00
Dinesh Yeduguru
58d08d100e
feat: Add back inference metrics and preserve context variables across asyncio boundary (#1552)
# What does this PR do?
This PR adds back the changes in #1300  which were reverted in  #1476 .

It also adds logic to preserve context variables across asyncio
boundary. this is needed with the library client since the async
generator logic yields control to code outside the event loop, and on
resuming, does not have the same context as before and this requires
preserving the context vars.

address #1477 
## Test Plan


```
 curl --request POST \
  --url http://localhost:8321/v1/inference/chat-completion \
  --header 'content-type: application/json' \
  --data '{
  "model_id": "meta-llama/Llama-3.1-70B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": {
        "type": "text",
        "text": "where do humans live"
      }
    }
  ],
  "stream": false
}' | jq .

{
  "metrics": [
    {
      "trace_id": "kCZwO3tyQC-FuAGb",
      "span_id": "bsP_5a5O",
      "timestamp": "2025-03-11T16:47:38.549084Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "prompt_tokens",
      "value": 10,
      "unit": "tokens"
    },
    {
      "trace_id": "kCZwO3tyQC-FuAGb",
      "span_id": "bsP_5a5O",
      "timestamp": "2025-03-11T16:47:38.549449Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "completion_tokens",
      "value": 369,
      "unit": "tokens"
    },
    {
      "trace_id": "kCZwO3tyQC-FuAGb",
      "span_id": "bsP_5a5O",
      "timestamp": "2025-03-11T16:47:38.549457Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "total_tokens",
      "value": 379,
      "unit": "tokens"
    }
  ],
  "completion_message": {
    "role": "assistant",
    "content": "Humans live on the planet Earth, specifically on its landmasses and in its oceans. Here's a breakdown of where humans live:\n\n1. **Continents:** Humans inhabit all seven continents:\n\t* Africa\n\t* Antarctica ( temporary residents, mostly scientists and researchers)\n\t* Asia\n\t* Australia\n\t* Europe\n\t* North America\n\t* South America\n2. **Countries:** There are 196 countries recognized by the United Nations, and humans live in almost all of them.\n3. **Cities and towns:** Many humans live in urban areas, such as cities and towns, which are often located near coastlines, rivers, or other bodies of water.\n4. **Rural areas:** Some humans live in rural areas, such as villages, farms, and countryside.\n5. **Islands:** Humans inhabit many islands around the world, including those in the Pacific, Indian, and Atlantic Oceans.\n6. **Mountains and highlands:** Humans live in mountainous regions, such as the Himalayas, the Andes, and the Rocky Mountains.\n7. **Deserts:** Some humans live in desert regions, such as the Sahara, the Mojave, and the Atacama.\n8. **Coastal areas:** Many humans live in coastal areas, such as beaches, ports, and coastal cities.\n9. **Underwater habitats:** A few humans live in underwater habitats, such as research stations and submarines.\n10. **Space:** A small number of humans have lived in space, including astronauts on the International Space Station and those who have visited the Moon.\n\nOverall, humans can be found living in almost every environment on Earth, from the frozen tundra to the hottest deserts, and from the highest mountains to the deepest oceans.",
    "stop_reason": "end_of_turn",
    "tool_calls": []
  },
  "logprobs": null
}

```

Orignal repro no longer showing any error:
```
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml
python -m examples.agents.e2e_loop_with_client_tools localhost 8321
```

client logs:
https://gist.github.com/dineshyv/047c7e87b18a5792aa660e311ea53166
server logs:
https://gist.github.com/dineshyv/97a2174099619e9916c7c490be26e559
2025-03-12 12:01:03 -07:00
Xi Yan
c7139b0b67
fix: fix precommit (#1594)
# What does this PR do?

- fix precommit

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
CI

[//]: # (## Documentation)
2025-03-12 11:59:21 -07:00
Botao Chen
90ca4d94de
fix: fix passthrough inference provider to make it work for agent (#1577)
## What does this PR do?
We noticed that the passthrough inference provider doesn't work agent
due to the type mis-match between client and server. We manually cast
the llama stack client type to llama stack server type to fix the issue.

## test 
run `python -m examples.agents.hello localhost 8321` within
llama-stack-apps

<img width="1073" alt="Screenshot 2025-03-11 at 8 43 44 PM"
src="https://github.com/user-attachments/assets/bd1bdd31-606a-420c-a249-95f6184cc0b1"
/>

fix https://github.com/meta-llama/llama-stack/issues/1560
2025-03-12 11:16:17 -07:00
Botao Chen
0b0be70605
feat: Add open benchmark template codegen (#1579)
## What does this PR do?

As title, add codegen for open-benchmark template

## test 

checked the new generated run.yaml file and it's identical before and
after the change

Also add small improvement to together template so that missing
TOGETHER_API_KEY won't crash the server which is the consistent user
experience as other remote providers
2025-03-12 11:12:08 -07:00
Charlie Doern
4eee349acd
fix: respect log_level in uvicorn and third party libs (#1524)
# What does this PR do?

uvicorn has a `log_level` arg in uvicorn.run, pass in the effective
level set by the logger.

Additionally, third party libraries like httpx are using our logging
format, but not honoring our log level.

This seems unintended, so loop through all items in the loggerDict and
apply the same log level as what we have set.


## Test Plan

before:

```
llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml
Environment variable LLAMA_STACK_LOGGING found: all=warn
Using virtual environment: /Users/charliedoern/projects/Documents/llama-stack/venv
+ python -m llama_stack.distribution.server.server --yaml-config /Users/charliedoern/.llama/distributions/ollama/ollama-run.yaml --port 8321
Environment variable LLAMA_STACK_LOGGING found: all=warn
WARNING  2025-03-10 16:05:49,706 root:71 uncategorized: Warning: `bwrap` is not available. Code interpreter tool will
         not work correctly.
INFO     2025-03-10 16:05:49,916 datasets:54 uncategorized: PyTorch version 2.5.1 available.
INFO     2025-03-10 16:05:50,010 httpx:1740 uncategorized: HTTP Request: GET http://localhost:11434/api/ps "HTTP/1.1 200
         OK"
INFO     2025-03-10 16:05:50,297 httpx:1740 uncategorized: HTTP Request: POST http://localhost:11434/api/pull "HTTP/1.1
         200 OK"
INFO     2025-03-10 16:05:50,314 httpx:1740 uncategorized: HTTP Request: GET http://localhost:11434/api/tags "HTTP/1.1
         200 OK"
INFO:     Started server process [89663]
INFO:     Waiting for application startup.
INFO:     ASGI 'lifespan' protocol appears unsupported.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
```

after:

```
llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml
Environment variable LLAMA_STACK_LOGGING found: all=warn
Using virtual environment: /Users/charliedoern/projects/Documents/llama-stack/venv
+ python -m llama_stack.distribution.server.server --yaml-config /Users/charliedoern/.llama/distributions/ollama/ollama-run.yaml --port 8321
Environment variable LLAMA_STACK_LOGGING found: all=warn
WARNING  2025-03-10 16:05:20,429 root:71 uncategorized: Warning: `bwrap` is not available. Code interpreter tool will
         not work correctly.
INFO     2025-03-10 16:05:20,639 datasets:54 uncategorized: PyTorch version 2.5.1 available.
```

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-03-12 11:07:28 -07:00
ehhuang
59dddafd12
feat: convert typehints from client_tool to litellm format (#1565)
Summary:
supports
https://github.com/meta-llama/llama-stack-client-python/pull/193

Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
2025-03-11 20:02:11 -07:00
Josh Salomon
5f90be5388
fix: Fixed bad file name in inline::localfs (#1358)
Bug https://github.com/meta-llama/llama-stack/issues/1357

# What does this PR do?
Fix a bug of a wrong file name in inline::localfs datasetio provider

[//]: # (If resolving an issue, uncomment and update the line below)
# (Closes #1357)

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: Josh Salomon <jsalomon@redhat.com>
2025-03-11 12:46:11 -07:00
Xi Yan
43044f29e2
fix: fix llama stack run with missing agent impl (#1559)
# What does this PR do?

- recent merge https://github.com/meta-llama/llama-stack/pull/1410
introduce error
```
ValueError: Provider meta-reference (Api.agents) does not implement the following methods:
[('list_agent_sessions', 'not_actually_implemented'), ('list_agents', 'not_actually_implemented')]
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
llama stack run
```

```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/agents/test_agents.py --text-model meta-llama/Llama-3.3-70B-Instruct
```

1379530386

[//]: # (## Documentation)
2025-03-11 11:22:22 -07:00
Dinesh Yeduguru
85501ed875
fix: remove Llama-3.2-1B-Instruct for fireworks (#1558)
# What does this PR do?
remove Llama-3.2-1B-Instruct for fireworks as its no longer appears to
be hosted on website.


## Test Plan

python distro_codegen.py
2025-03-11 11:19:29 -07:00
Charlie Doern
b647ecd9ed
feat: add support for LLAMA_STACK_LOG_FILE (#1450)
# What does this PR do?

setting $LLAMA_STACK_LOG_FILE will pipe the logs to a file as well as
stdout. this is done by using a logging FileHandler

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-03-11 11:09:31 -07:00
Sébastien Han
83a2c78615
feat(api): list agents / sessions and get agent (#1410)
# What does this PR do?

Add support for listing agents, describing an agent, and retrieving
session IDs for a given agent. This is only the API definition, the
implementations will come separately.

Closes: https://github.com/meta-llama/llama-stack/issues/1294

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-11 10:33:46 -07:00
Ihar Hrachyshka
aca82df7ed
fix: Multiple fixes for server shutdown (fix lifespan handling; fix handling CancelledError when raised by provider; let uvicorn handle signals) (#1495)
# What does this PR do?

If implementation raises CancelledError (e.g. when it runs its own async
loop for jobs), the main server shutdown handler gets confused and
doesn't attempt to shut down the main loop tasks.

While at it, also fixing the following failure when this happens:

```
UnboundLocalError: cannot access local variable 'loop' where it is not
associated with a value
```

Shutdown handlers were not running because lifespan logic was broken
since ~Oct 2024. Fixed that too and enforcing `lifespan` now (making
sure server will crash when it fails to interact with app through
middleware).

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Spotted while working on
https://github.com/meta-llama/llama-stack/pull/1437

One way to trigger it without the PR above is to add `raise
CancelledError` in
any of the running providers' `shutdown` methods; then `kill -INT <pid>`
the
server process.

Validated this with the following test patch:

```
diff --git a/llama_stack/distribution/server/server.py b/llama_stack/distribution/server/server.py
index b85c463a..10dad83e 100644
--- a/llama_stack/distribution/server/server.py
+++ b/llama_stack/distribution/server/server.py
@@ -174,6 +174,7 @@ def handle_signal(app, signum, _) -> None:
         except asyncio.CancelledError:
             pass
         finally:
+            logger.info("Stopping event loop")
             loop.stop()
 
     loop = asyncio.get_running_loop()
diff --git a/llama_stack/providers/inline/post_training/torchtune/post_training.py b/llama_stack/providers/inline/post_training/torchtune/post_training.py
index b837362d..163f43d8 100644
--- a/llama_stack/providers/inline/post_training/torchtune/post_training.py
+++ b/llama_stack/providers/inline/post_training/torchtune/post_training.py
@@ -3,6 +3,7 @@
 #
 # This source code is licensed under the terms described in the LICENSE file in
 # the root directory of this source tree.
+import asyncio
 from datetime import datetime
 from typing import Any, Dict, Optional
 
@@ -43,6 +44,9 @@ class TorchtunePostTrainingImpl:
         self.jobs = {}
         self.checkpoints_dict = {}
 
+    async def shutdown(self) -> None:
+        raise asyncio.CancelledError("Shutdown")
+
     async def supervised_fine_tune(
         self,
         job_uuid: str,
```

Without the fix:

```
INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO:     Shutting down
INFO:     Finished server process [52099]
INFO     2025-03-07 23:25:33,548 __main__:143 server: Received signal SIGINT (2). Exiting gracefully...
INFO     2025-03-07 23:25:33,550 __main__:150 server: Shutting down DatasetsRoutingTable
INFO     2025-03-07 23:25:33,551 __main__:177 server: Stopping event loop
ERROR    2025-03-07 23:25:33,552 asyncio:1785 uncategorized: unhandled exception during asyncio.run() shutdown
         task: <Task finished name='Task-12' coro=<handle_signal.<locals>.shutdown() done, defined at
         /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py:145>
         exception=UnboundLocalError("cannot access local variable 'loop' where it is not associated with a value")>
         ╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
         │ /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py:178 in shutdown           │
         │                                                                                                             │
         │   175 │   │   │   pass                                                                                      │
         │   176 │   │   finally:                                                                                      │
         │   177 │   │   │   logger.info("Stopping event loop")                                                        │
         │ ❱ 178 │   │   │   loop.stop()                                                                               │
         │   179 │                                                                                                     │
         │   180 │   loop = asyncio.get_running_loop()                                                                 │
         │   181 │   loop.create_task(shutdown())                                                                      │
         ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
         UnboundLocalError: cannot access local variable 'loop' where it is not associated with a value

```

With the fix, now seeing the following messages when the server is
killed:

```
INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO:     Shutting down
INFO:     Finished server process [50836]
INFO     2025-03-07 23:20:35,182 __main__:143 server: Received signal SIGINT (2). Exiting gracefully...
INFO     2025-03-07 23:20:35,184 __main__:149 server: Shutting down DatasetsRoutingTable
ERROR    2025-03-07 23:20:35,185 __main__:158 server: Failed to shutdown DatasetsRoutingTable: {CancelledError()}
         ╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
         │ /usr/lib64/python3.11/asyncio/tasks.py:476 in wait_for                                                      │
         │                                                                                                             │
         │   473 │   try:                                                                                              │
         │   474 │   │   # wait until the future completes or the timeout                                              │
         │   475 │   │   try:                                                                                          │
         │ ❱ 476 │   │   │   await waiter                                                                              │
         │   477 │   │   except exceptions.CancelledError:                                                             │
         │   478 │   │   │   if fut.done():                                                                            │
         │   479 │   │   │   │   return fut.result()                                                                   │
         ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
         CancelledError

         During handling of the above exception, another exception occurred:

         ╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
         │ /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py:152 in shutdown           │
         │                                                                                                             │
         │   149 │   │   │   logger.info("Shutting down %s", impl_name)                                                │
         │   150 │   │   │   try:                                                                                      │
         │   151 │   │   │   │   if hasattr(impl, "shutdown"):                                                         │
         │ ❱ 152 │   │   │   │   │   await asyncio.wait_for(impl.shutdown(), timeout=5)                                │
         │   153 │   │   │   │   else:                                                                                 │
         │   154 │   │   │   │   │   logger.warning("No shutdown method for %s", impl_name)                            │
         │   155 │   │   │   except asyncio.TimeoutError:                                                              │
         │                                                                                                             │
         │ /usr/lib64/python3.11/asyncio/tasks.py:479 in wait_for                                                      │
         │                                                                                                             │
         │   476 │   │   │   await waiter                                                                              │
         │   477 │   │   except exceptions.CancelledError:                                                             │
         │   478 │   │   │   if fut.done():                                                                            │
         │ ❱ 479 │   │   │   │   return fut.result()                                                                   │
         │   480 │   │   │   else:                                                                                     │
         │   481 │   │   │   │   fut.remove_done_callback(cb)                                                          │
         │   482 │   │   │   │   # We must ensure that the task is not running                                         │
         │                                                                                                             │
         │ /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/routers/routing_tables.py:131 in shutdown  │
         │                                                                                                             │
         │   128 │   │   │   elif api == Api.tool_runtime:                                                             │
         │   129 │   │   │   │   p.tool_store = self                                                                   │
         │   130 │                                                                                                     │
         │ ❱ 131 │   async def shutdown(self) -> None:                                                                 │
         │   132 │   │   for p in self.impls_by_provider_id.values():                                                  │
         │   133 │   │   │   await p.shutdown()                                                                        │
         │   134                                                                                                       │
         ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
         CancelledError
INFO     2025-03-07 23:20:35,295 __main__:149 server: Shutting down DatasetIORouter
INFO     2025-03-07 23:20:35,296 __main__:149 server: Shutting down ScoringFunctionsRoutingTable
INFO     2025-03-07 23:20:35,297 __main__:149 server: Shutting down ScoringRouter
INFO     2025-03-07 23:20:35,298 __main__:149 server: Shutting down ModelsRoutingTable
INFO     2025-03-07 23:20:35,299 __main__:149 server: Shutting down InferenceRouter
INFO     2025-03-07 23:20:35,300 __main__:149 server: Shutting down ShieldsRoutingTable
INFO     2025-03-07 23:20:35,300 __main__:149 server: Shutting down SafetyRouter
INFO     2025-03-07 23:20:35,301 __main__:149 server: Shutting down VectorDBsRoutingTable
INFO     2025-03-07 23:20:35,302 __main__:149 server: Shutting down VectorIORouter
INFO     2025-03-07 23:20:35,303 __main__:149 server: Shutting down ToolGroupsRoutingTable
INFO     2025-03-07 23:20:35,304 __main__:149 server: Shutting down ToolRuntimeRouter
INFO     2025-03-07 23:20:35,304 __main__:149 server: Shutting down MetaReferenceAgentsImpl
INFO     2025-03-07 23:20:35,305 __main__:149 server: Shutting down TelemetryAdapter
INFO     2025-03-07 23:20:35,306 __main__:149 server: Shutting down TorchtunePostTrainingImpl
ERROR    2025-03-07 23:20:35,307 __main__:158 server: Failed to shutdown TorchtunePostTrainingImpl:
         {CancelledError('Shutdown')}
         ╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
         │ /home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py:152 in shutdown           │
         │                                                                                                             │
         │   149 │   │   │   logger.info("Shutting down %s", impl_name)                                                │
         │   150 │   │   │   try:                                                                                      │
         │   151 │   │   │   │   if hasattr(impl, "shutdown"):                                                         │
         │ ❱ 152 │   │   │   │   │   await asyncio.wait_for(impl.shutdown(), timeout=5)                                │
         │   153 │   │   │   │   else:                                                                                 │
         │   154 │   │   │   │   │   logger.warning("No shutdown method for %s", impl_name)                            │
         │   155 │   │   │   except asyncio.TimeoutError:                                                              │
         │                                                                                                             │
         │ /usr/lib64/python3.11/asyncio/tasks.py:489 in wait_for                                                      │
         │                                                                                                             │
         │   486 │   │   │   │   raise                                                                                 │
         │   487 │   │                                                                                                 │
         │   488 │   │   if fut.done():                                                                                │
         │ ❱ 489 │   │   │   return fut.result()                                                                       │
         │   490 │   │   else:                                                                                         │
         │   491 │   │   │   fut.remove_done_callback(cb)                                                              │
         │   492 │   │   │   # We must ensure that the task is not running                                             │
         │                                                                                                             │
         │ /home/ec2-user/src/llama-stack/schedule/llama_stack/providers/inline/post_training/torchtune/post_training. │
         │ py:48 in shutdown                                                                                           │
         │                                                                                                             │
         │    45 │   │   self.checkpoints_dict = {}                                                                    │
         │    46 │                                                                                                     │
         │    47 │   async def shutdown(self) -> None:                                                                 │
         │ ❱  48 │   │   raise asyncio.CancelledError("Shutdown")                                                      │
         │    49 │                                                                                                     │
         │    50 │   async def supervised_fine_tune(                                                                   │
         │    51 │   │   self,                                                                                         │
         ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
         CancelledError: Shutdown
INFO     2025-03-07 23:20:35,352 __main__:149 server: Shutting down BenchmarksRoutingTable
INFO     2025-03-07 23:20:35,353 __main__:149 server: Shutting down EvalRouter
INFO     2025-03-07 23:20:35,354 __main__:149 server: Shutting down DistributionInspectImpl
INFO     2025-03-07 23:20:35,355 __main__:177 server: Stopping event loop
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py", line 488, in <module>
    main()
  File "/home/ec2-user/src/llama-stack/schedule/llama_stack/distribution/server/server.py", line 476, in main
    uvicorn.run(**uvicorn_config)
  File "/home/ec2-user/src/llama-stack/schedule/venv/lib64/python3.11/site-packages/uvicorn/main.py", line 579, in run
    server.run()
  File "/home/ec2-user/src/llama-stack/schedule/venv/lib64/python3.11/site-packages/uvicorn/server.py", line 66, in run
    return asyncio.run(self.serve(sockets=sockets))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/asyncio/runners.py", line 189, in run
    with Runner(debug=debug) as runner:
  File "/usr/lib64/python3.11/asyncio/runners.py", line 63, in __exit__
    self.close()
  File "/usr/lib64/python3.11/asyncio/runners.py", line 71, in close
    _cancel_all_tasks(loop)
  File "/usr/lib64/python3.11/asyncio/runners.py", line 201, in _cancel_all_tasks
    loop.run_until_complete(tasks.gather(*to_cancel, return_exceptions=True))
  File "/usr/lib64/python3.11/asyncio/base_events.py", line 652, in run_until_complete
    raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed.
++ error_handler 104
++ echo 'Error occurred in script at line: 104'
Error occurred in script at line: 104
++ exit 1
```

With all patches included, the shutdown now looks as follows:

```
$ kill -INT $(ps ax | grep  llama_stack.distribution.server.server | grep -v nvim | awk -e '{print $1}' | sort | head -n 1)
```

```
20:56:09.308 [START]
INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO     2025-03-10 20:56:43,961 __main__:140 server: Shutting down
INFO     2025-03-10 20:56:43,962 __main__:124 server: Shutting down DatasetsRoutingTable
INFO     2025-03-10 20:56:43,964 __main__:124 server: Shutting down DatasetIORouter
INFO     2025-03-10 20:56:43,965 __main__:124 server: Shutting down ScoringFunctionsRoutingTable
INFO     2025-03-10 20:56:43,966 __main__:124 server: Shutting down ScoringRouter
INFO     2025-03-10 20:56:43,967 __main__:124 server: Shutting down ModelsRoutingTable
INFO     2025-03-10 20:56:43,968 __main__:124 server: Shutting down InferenceRouter
INFO     2025-03-10 20:56:43,969 __main__:124 server: Shutting down ShieldsRoutingTable
INFO     2025-03-10 20:56:43,971 __main__:124 server: Shutting down SafetyRouter
INFO     2025-03-10 20:56:43,972 __main__:124 server: Shutting down VectorDBsRoutingTable
INFO     2025-03-10 20:56:43,973 __main__:124 server: Shutting down VectorIORouter
INFO     2025-03-10 20:56:43,974 __main__:124 server: Shutting down ToolGroupsRoutingTable
INFO     2025-03-10 20:56:43,975 __main__:124 server: Shutting down ToolRuntimeRouter
INFO     2025-03-10 20:56:43,976 __main__:124 server: Shutting down MetaReferenceAgentsImpl
INFO     2025-03-10 20:56:43,977 __main__:124 server: Shutting down TelemetryAdapter
INFO     2025-03-10 20:56:43,978 __main__:124 server: Shutting down TorchtunePostTrainingImpl
WARNING  2025-03-10 20:56:43,979 __main__:129 server: No shutdown method for TorchtunePostTrainingImpl
INFO     2025-03-10 20:56:43,979 __main__:124 server: Shutting down BenchmarksRoutingTable
INFO     2025-03-10 20:56:43,980 __main__:124 server: Shutting down EvalRouter
INFO     2025-03-10 20:56:43,981 __main__:124 server: Shutting down DistributionInspectImpl
INFO:     Application shutdown complete.
INFO:     Finished server process [33862]
```

[//]: # (## Documentation)

---------

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-11 10:30:55 -07:00
Ihar Hrachyshka
c3d7d17bc4
chore: fix typing hints for get_provider_impl deps arguments (#1544)
# What does this PR do?

It's a dict that may contain different types, as per
resolver:instantiate_provider implementation. (AFAIU it also never
contains ProviderSpecs, but *instances* of provider implementations.)

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

mypy passing if enabled checks for these modules. (See #1543)

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-11 10:07:28 -07:00
Ihar Hrachyshka
0e73186a11
fix: Add missing shutdown handler for TorchtunePostTrainingImpl (#1535)
# What does this PR do?

Added missing shutdown handler. (Currently empty.)

Without it, when server shuts down, it posts the following warning:

```
__main__:129 server: No shutdown method for TorchtunePostTrainingImpl
```

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>


[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

(The test plan assumes shutdown logic is fixed, see #1495)

Without the patch:

```
INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO     2025-03-10 20:56:43,961 __main__:140 server: Shutting down
INFO     2025-03-10 20:56:43,962 __main__:124 server: Shutting down DatasetsRoutingTable
INFO     2025-03-10 20:56:43,964 __main__:124 server: Shutting down DatasetIORouter
INFO     2025-03-10 20:56:43,965 __main__:124 server: Shutting down ScoringFunctionsRoutingTable
INFO     2025-03-10 20:56:43,966 __main__:124 server: Shutting down ScoringRouter
INFO     2025-03-10 20:56:43,967 __main__:124 server: Shutting down ModelsRoutingTable
INFO     2025-03-10 20:56:43,968 __main__:124 server: Shutting down InferenceRouter
INFO     2025-03-10 20:56:43,969 __main__:124 server: Shutting down ShieldsRoutingTable
INFO     2025-03-10 20:56:43,971 __main__:124 server: Shutting down SafetyRouter
INFO     2025-03-10 20:56:43,972 __main__:124 server: Shutting down VectorDBsRoutingTable
INFO     2025-03-10 20:56:43,973 __main__:124 server: Shutting down VectorIORouter
INFO     2025-03-10 20:56:43,974 __main__:124 server: Shutting down ToolGroupsRoutingTable
INFO     2025-03-10 20:56:43,975 __main__:124 server: Shutting down ToolRuntimeRouter
INFO     2025-03-10 20:56:43,976 __main__:124 server: Shutting down MetaReferenceAgentsImpl
INFO     2025-03-10 20:56:43,977 __main__:124 server: Shutting down TelemetryAdapter
INFO     2025-03-10 20:56:43,978 __main__:124 server: Shutting down TorchtunePostTrainingImpl
WARNING  2025-03-10 20:56:43,979 __main__:129 server: No shutdown method for TorchtunePostTrainingImpl
INFO     2025-03-10 20:56:43,979 __main__:124 server: Shutting down BenchmarksRoutingTable
INFO     2025-03-10 20:56:43,980 __main__:124 server: Shutting down EvalRouter
INFO     2025-03-10 20:56:43,981 __main__:124 server: Shutting down DistributionInspectImpl
INFO:     Application shutdown complete.
INFO:     Finished server process [33862]
```

Run with the patch and observe no warning:

```
$ kill -INT $(ps ax | grep  llama_stack.distribution.server.server | grep -v nvim | awk -e '{print $1}' | sort | head -n 1)
```

```
INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO     2025-03-11 00:32:56,863 __main__:140 server: Shutting down
INFO     2025-03-11 00:32:56,864 __main__:124 server: Shutting down DatasetsRoutingTable
INFO     2025-03-11 00:32:56,866 __main__:124 server: Shutting down DatasetIORouter
INFO     2025-03-11 00:32:56,867 __main__:124 server: Shutting down ScoringFunctionsRoutingTable
INFO     2025-03-11 00:32:56,868 __main__:124 server: Shutting down ScoringRouter
INFO     2025-03-11 00:32:56,869 __main__:124 server: Shutting down ModelsRoutingTable
INFO     2025-03-11 00:32:56,870 __main__:124 server: Shutting down InferenceRouter
INFO     2025-03-11 00:32:56,871 __main__:124 server: Shutting down ShieldsRoutingTable
INFO     2025-03-11 00:32:56,872 __main__:124 server: Shutting down SafetyRouter
INFO     2025-03-11 00:32:56,873 __main__:124 server: Shutting down VectorDBsRoutingTable
INFO     2025-03-11 00:32:56,874 __main__:124 server: Shutting down VectorIORouter
INFO     2025-03-11 00:32:56,875 __main__:124 server: Shutting down ToolGroupsRoutingTable
INFO     2025-03-11 00:32:56,876 __main__:124 server: Shutting down ToolRuntimeRouter
INFO     2025-03-11 00:32:56,877 __main__:124 server: Shutting down MetaReferenceAgentsImpl
INFO     2025-03-11 00:32:56,878 __main__:124 server: Shutting down TelemetryAdapter
INFO     2025-03-11 00:32:56,879 __main__:124 server: Shutting down TorchtunePostTrainingImpl
INFO     2025-03-11 00:32:56,880 __main__:124 server: Shutting down BenchmarksRoutingTable
INFO     2025-03-11 00:32:56,881 __main__:124 server: Shutting down EvalRouter
INFO     2025-03-11 00:32:56,882 __main__:124 server: Shutting down DistributionInspectImpl

```

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-11 10:01:09 -07:00
Ashwin Bharambe
e13c92f269
revert: feat(server): Use system packages for execution (#1551)
Reverts meta-llama/llama-stack#1252

The above PR breaks the following invocation:
```bash
llama stack run ~/.llama/distributions/together/together-run.yaml
```
2025-03-11 09:58:25 -07:00
Dinesh Yeduguru
ead9397e22
fix: tracing fixes for trace context propogation across coroutines (#1522)
# What does this PR do?
This PR has two fixes needed for correct trace context propagation
across asycnio boundary
Fix 1: Start using context vars to store the global trace context.
This is needed since we cannot use the same trace context across
coroutines since the state is shared. each coroutine
should have its own trace context so that each of it can start storing
its state correctly.
Fix 2: Start a new span for each new coroutines started for running
shields to keep the span tree clean


## Test Plan

### Integration tests with server
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run
~/.llama/distributions/together/together-run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -s --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
server logs:
https://gist.github.com/dineshyv/51ac5d9864ed031d0d89ce77352821fe
test logs:
https://gist.github.com/dineshyv/e66acc1c4648a42f1854600609c467f3
 
### Integration tests with library client
LLAMA_STACK_CONFIG=fireworks pytest -s --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct

logs: https://gist.github.com/dineshyv/ca160696a0b167223378673fb1dcefb8

### Apps test with server:
```
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/together/together-run.yaml
python -m examples.agents.e2e_loop_with_client_tools localhost 8321
```
server logs:
https://gist.github.com/dineshyv/1717a572d8f7c14279c36123b79c5797
app logs:
https://gist.github.com/dineshyv/44167e9f57806a0ba3b710c32aec02f8
2025-03-11 07:12:48 -07:00
Botao Chen
e3edca7739
feat: [new open benchmark] Math 500 (#1538)
## What does this PR do?
Created a new math_500 open-benchmark based on OpenAI's [Let's Verify
Step by Step](https://arxiv.org/abs/2305.20050) paper and hugging face's
[HuggingFaceH4/MATH-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500)
dataset.

The challenge part of this benchmark is to parse the generated and
expected answer and verify if they are same. For the parsing part, we
refer to [Minerva: Solving Quantitative Reasoning Problems with Language
Models](https://research.google/blog/minerva-solving-quantitative-reasoning-problems-with-language-models/).

To simply the parse logic, as the next step, we plan to also refer to
what [simple-eval](https://github.com/openai/simple-evals) is doing,
using llm as judge to check if the generated answer matches the expected
answer or not


## Test Plan
on sever side, spin up a server with open-benchmark template `llama
stack run llama_stack/templates/open-benchamrk/run.yaml`

on client side, issue an open benchmark eval request `llama-stack-client
--endpoint xxx eval run-benchmark "meta-reference-math-500" --model-id
"meta-llama/Llama-3.3-70B-Instruct" --output-dir "/home/markchen1015/"
--num-examples 20` and get ther aggregated eval results
<img width="238" alt="Screenshot 2025-03-10 at 7 57 04 PM"
src="https://github.com/user-attachments/assets/2c9da042-3b70-470e-a7c4-69f4cc24d1fb"
/>

check the generated answer and the related scoring and they make sense
2025-03-10 20:38:28 -07:00
Ashwin Bharambe
dc84bc755a
fix: revert to using faiss for ollama distro (#1530)
This is unfortunate because `sqlite-vec` seems promising. But its PIP
package is not quite complete. It does not have binary for arm64 (I
think, or maybe it even lacks 64 bit builds?) which results in the arm64
container resulting in
```
File "/usr/local/lib/python3.10/site-packages/sqlite_vec/init.py", line 17, in load
    conn.load_extension(loadable_path())
sqlite3.OperationalError: /usr/local/lib/python3.10/site-packages/sqlite_vec/vec0.so: wrong ELF class: ELFCLASS32
```

To get around I tried to install from source via `uv pip install
sqlite-vec --no-binary=sqlite-vec` however it even lacks a source
distribution which makes that impossible.

## Test Plan

Build the container locally using: 

```bash
LLAMA_STACK_DIR=. llama stack build --template ollama --image-type container
```

Run the container as: 

```
podman run --privileged -it -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
   -v ~/.llama:/root/.llama \
    --env INFERENCE_MODEL=$INFERENCE_MODEL \
    --env OLLAMA_URL=http://host.containers.internal:11434 \
    -v ~/local/llama-stack:/app/llama-stack-source 
    localhost/distribution-ollama:dev --port $LLAMA_STACK_PORT
```

Verify the container starts up correctly. Without this patch, it would
encounter the ELFCLASS32 error.
2025-03-10 16:15:17 -07:00
Sébastien Han
21e39633d8
feat(server): Use system packages for execution (#1252)
# What does this PR do?

Users prefer to rely on the main CLI rather than invoking the server
through a Python module. Users interact with a high-level CLI rather
than needing to know internal module structures.

Now, when running llama stack run <path-to-config>, the server will
attempt to use the system package or a virtual environment if one is
active.

This also eliminates the current process dependency chain when running
from a virtual environment:

-> llama stack run
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -> start_env.sh

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
-> python -m server...

Signed-off-by: Sébastien Han <seb@redhat.com>

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Run:

```
ollama run llama3.2:3b-instruct-fp16 --keepalive=2m &
llama stack run ./llama_stack/templates/ollama/run.yaml --disable-ipv6
```

Notice that the server starts and shutdowns normally.

[//]: # (## Documentation)

---------

Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-03-10 16:01:03 -07:00
Charlie Doern
7559b4055e
chore: add color to Env Variable message (#1525)
# What does this PR do?

currently the `"Environment variable LLAMA_STACK_LOGGING found"` message
is printed with no color switch to cprint and highlight in yellow for
visibility

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-03-10 15:29:40 -07:00
ehhuang
0e3c0cf8de
fix: server logging (#1521)
Summary:

Test Plan:

ERROR 2025-03-10 10:53:00,804 __main__:239 server: Error executing
endpoint route='/v1/inference/chat-completion'
         method='post'
2025-03-10 15:25:23 -07:00
Sarthak Deshpande
921f8b1125
chore: Together async client (#1510)
# What does this PR do?
Uses together async client instead of sync client

[//]: # (If resolving an issue, uncomment and update the line below)

## Test Plan
Command to run the test is in the image below(2 tests fail, and they
were failing for the old stable version as well with the same errors.)
<img width="1689" alt="image"
src="https://github.com/user-attachments/assets/503db720-5379-425d-9844-0225010e41a1"
/>


[//]: # (## Documentation)

---------

Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>
2025-03-10 15:25:01 -07:00
James Kunstle
735892cbd2
refactor: ImageType to LlamaStackImageType (#1500)
This disambiguates "Image" term from "container image" alternative usage
and allows for:

```python

if image_type == LlamaStackImagetype.venv:
	...

```

accesses rather than `ImageType.venv.value`

# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

Changes enum use to comply with semantic python styling and naming
conventions.

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

Refactor was automated and small so simple run-through of creating
images was done.

Signed-off-by: James Kunstle <jkunstle@redhat.com>
2025-03-10 17:12:53 -04:00
Reid
0b8cb830b9
docs: update ollama doc url (#1508)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

It should changed in this pr
https://github.com/meta-llama/llama-stack/pull/1190/files#diff-53e3f35ced54ee5e57dc8b0d3b04770ed84f2f6434c6f492f42569b3c2810ecd

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-03-10 13:04:59 -07:00
Sarthak Deshpande
a9c5d3cd3d
chore: made inbuilt tools blocking calls into async non blocking calls (#1509)
# What does this PR do?
This PR converts blocking calls for in built tools like wolfram, brave,
tavily and bing into non blocking async calls
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
pytest -s -v tool_runtime/test_builtin_tools.py --stack-config=together
--text-model=meta-llama/Llama-3.1-8B-Instruct
Used the command above to get the below results
<img width="1710" alt="image"
src="https://github.com/user-attachments/assets/76b0ca06-f6e4-45fa-a114-0449bef2325b"
/>


<img width="1389" alt="image"
src="https://github.com/user-attachments/assets/5220ccbb-7882-4240-b17e-f362ad46d25b"
/>

<img width="1432" alt="image"
src="https://github.com/user-attachments/assets/bb93a41e-e82a-4c98-a22d-6b0e320aa974"
/>

[//]: # (## Documentation)

---------

Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>
2025-03-09 16:59:24 -07:00
Ashwin Bharambe
70ff226b6a fix(library_client): ensure pending asyncio tasks like generator athrow are executed 2025-03-09 16:17:27 -07:00
Ashwin Bharambe
205661bc78
fix: Use re-entrancy and concurrency safe context managers for provider data (#1498)
Concurrent requests should not trample (or reuse) each others' provider
data. Provider data should be scoped to each request.

## Test Plan

Set the uvicorn server to have a single worker process + thread by
updating the config:
```python
    uvicorn_config = {
        ...
        "workers": 1,
        "loop": "asyncio",
    }
```

Then perform the following steps on `origin/main` (without this change).

(1) Run the server using `llama stack run dev` without having
`FIREWORKS_API_KEY` in the environment.

(2) Run a test by specifying the FIREWORKS_API_KEY env var so it gets
stored in the thread local
```
pytest -s -v tests/integration/inference/test_text_inference.py \
    --stack-config http://localhost:8321 \
    --text-model accounts/fireworks/models/llama-v3p1-8b-instruct \
    -k test_text_chat_completion_with_tool_calling_and_streaming \
     --env FIREWORKS_API_KEY=<...>
``` 
Ensure you don't have any other API keys in the environment (otherwise
the bug will not reproduce due to other specifics in our testing code.)
Verify this works.

(3) Run the same command again without specifying FIREWORKS_API_KEY. See
that the request actually succeeds when it *should have failed*.


----
Now do the same tests on this branch, verify step (3) results in
failure.

Finally, run the full `test_text_inference.py` test suite with this
change, verify it succeeds.
2025-03-08 22:56:30 -08:00