Commit graph

600 commits

Author SHA1 Message Date
Xi Yan
bc0cd07008 Merge branch 'main' into eval_api_final 2025-03-26 12:29:45 -07:00
Ihar Hrachyshka
367c08f01e
feat(api): don't return a payload on file delete (#1640)
# What does this PR do?

This is to stay consistent with other APIs.

This change registers files in API, even though there are still no
providers. Removing tests that require a provider existing for a merged
API to enable it in API layer.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-25 17:12:36 -07:00
ehhuang
2f38851751
chore: Revert "chore(telemetry): remove service_name entirely" (#1785)
Reverts meta-llama/llama-stack#1755 closes #1781
2025-03-25 14:42:05 -07:00
Rashmi Pawar
1a73f8305b
feat: Add nemo customizer (#1448)
# What does this PR do?

This PR adds support for NVIDIA's NeMo Customizer API to the Llama Stack
post-training module. The integration enables users to fine-tune models
using NVIDIA's cloud-based customization service through a consistent
Llama Stack interface.


[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
Yet to be done

Things pending under this PR:

- [x] Integration of fine-tuned model(new checkpoint) for inference with
nvidia llm distribution
- [x] distribution integration of API
- [x] Add test cases for customizer(In Progress)
- [x] Documentation

```

LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -v tests/client-sdk/post_training/test_supervised_fine_tuning.py 

============================================================================================================================================================================ test session starts =============================================================================================================================================================================
platform linux -- Python 3.10.0, pytest-8.3.4, pluggy-1.5.0 -- /home/ubuntu/llama-stack/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.0', 'Platform': 'Linux-6.8.0-1021-gcp-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'nbval': '0.11.0', 'metadata': '3.1.1', 'anyio': '4.8.0', 'html': '4.1.1', 'asyncio': '0.25.3'}}
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: nbval-0.11.0, metadata-3.1.1, anyio-4.8.0, html-4.1.1, asyncio-0.25.3
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 2 items                                                                                                                                                                                                                                                                                                                                                            

tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_post_training_provider_registration[txt=8B] PASSED                                                                                                                                                                                                                                                 [ 50%]
tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_list_training_jobs[txt=8B] PASSED                                                                                                                                                                                                                                                                  [100%]

======================================================================================================================================================================== 2 passed, 1 warning in 0.10s ========================================================================================================================================================================
```
cc: @mattf @dglogo @sumitb

---------

Co-authored-by: Ubuntu <ubuntu@llama-stack-customizer-dev-inst-2tx95fyisatvlic4we8hidx5tfj.us-central1-a.c.brevdevprod.internal>
2025-03-25 11:01:10 -07:00
Yuan Tang
441016bee8
feat: Support "stop" parameter in remote:vLLM (#1715)
# What does this PR do?

This adds support for "stop" parameter:
https://platform.openai.com/docs/api-reference/completions/create#completions-create-stop

## Test Plan

```
tests/integration/inference/test_text_inference.py::test_text_completion_non_streaming[txt=8B-inference:completion:sanity] PASSED                                  [  5%]
tests/integration/inference/test_text_inference.py::test_text_completion_streaming[txt=8B-inference:completion:sanity] PASSED                                      [ 11%]
tests/integration/inference/test_text_inference.py::test_text_completion_stop_sequence[txt=8B-inference:completion:stop_sequence] PASSED                           [ 16%]
tests/integration/inference/test_text_inference.py::test_text_completion_log_probs_non_streaming[txt=8B-inference:completion:log_probs] PASSED                     [ 22%]
tests/integration/inference/test_text_inference.py::test_text_completion_log_probs_streaming[txt=8B-inference:completion:log_probs] PASSED                         [ 27%]
tests/integration/inference/test_text_inference.py::test_text_completion_structured_output[txt=8B-inference:completion:structured_output] PASSED                   [ 33%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=8B-inference:chat_completion:non_streaming_01] PASSED              [ 38%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=8B-inference:chat_completion:non_streaming_02] PASSED              [ 44%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_first_token_profiling[txt=8B-inference:chat_completion:ttft] ^TPASSED                  [ 50%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_streaming[txt=8B-inference:chat_completion:streaming_01] PASSED                      [ 55%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_streaming[txt=8B-inference:chat_completion:streaming_02] PASSED                      [ 61%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[txt=8B-inference:chat_completion:tool_calling] PASSED [ 66%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[txt=8B-inference:chat_completion:tool_calling] PASSED [ 72%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_required[txt=8B-inference:chat_completion:tool_calling] PASSED      [ 77%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_none[txt=8B-inference:chat_completion:tool_calling] PASSED          [ 83%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_structured_output[txt=8B-inference:chat_completion:structured_output] PASSED         [ 88%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-True] PASSED [ 94%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-False] PASSED [100%]

=============================================================== 18 passed, 3 warnings in 755.79s (0:12:35) ===============================================================
```

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-24 12:42:55 -07:00
Francisco Arceo
9e1ddf2b53
chore: Updating sqlite-vec to make non-blocking calls (#1762)
# What does this PR do?
This PR updates the sqlite-vec database calls to be non-blocking. Note
that each operation creates a new connection, which incurs some
performance overhead but is reasonable given [SQLite's threading and
connections constraints](https://www.sqlite.org/threadsafe.html).

Summary of changes:
- Refactored `SQLiteVecIndex` class to store database path instead of
connection object
- Added `_create_sqlite_connection()` helper function to create
connections on demand
- Ensured proper connection closure in all database operations
- Fixed test fixtures to use a file-based SQLite database for
thread-safety
- Updated the `SQLiteVecVectorIOAdapter` class to handle per-operation
connections

This PR helps chip away at
https://github.com/meta-llama/llama-stack/issues/1489

## Test Plan
sqlite-vec unit tests passed locally as well as a test script using the
client as a library.

## Misc

FYI @varshaprasad96 @kevincogan

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-03-23 17:25:44 -07:00
Xi Yan
094eb6a5ae
feat(rag): entire document context with attachments (#1763)
# What does this PR do?
**What**
Instead of adhoc creating a vectordb and chunking when documents ae sent
as an attachment to agent turn, we directly pass raw text from document
into messages to model for user context, and let model perform
summarization directly.

This removes the magic behaviour, and yields better performance than
existing approach.

**Improved Performance**
- RAG lifecycle notebook
  - Model: 0.3 factuality score
  - (+ websearch) Agent: 0.44 factuality score
  - (+ vector db) Agent: 0.3 factuality score
  - (+ raw context) Agent: 0.6 factuality score

Closes https://github.com/meta-llama/llama-stack/issues/1478

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
- [NEW] added section in RAG lifecycle notebook shows better performance

<img width="840" alt="image"
src="https://github.com/user-attachments/assets/a0c4e816-809a-41c0-9124-89825983e3f5"
/>


[//]: # (## Documentation)
2025-03-23 16:57:48 -07:00
Xi Yan
7f12ea290f
feat(eval api): (2.3/n) remove scoring / eval impls + benchmarks (#1766)
# What does this PR do?
- Remove `/eval` and `/scoring` impls
- Clean up benchmarks. The benchmarks exists in the `llama-stack-evals`
repo.
- Rest of grading functions will be added in follow up PR. 

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
- CI

[//]: # (## Documentation)
2025-03-23 16:51:17 -07:00
Xi Yan
81bc051411 fix precommit 2025-03-23 16:32:06 -07:00
Xi Yan
5038f0e376 precommit 2025-03-23 16:27:56 -07:00
Xi Yan
64388de068 precommit 2025-03-23 16:15:08 -07:00
Xi Yan
3f8c7a584a precommit 2025-03-23 16:00:48 -07:00
Xi Yan
a54d757ade merge 2025-03-23 15:48:14 -07:00
ehhuang
06788643b3
feat(telemetry): clean up spans (#1760) 2025-03-21 20:05:11 -07:00
Dinesh Yeduguru
5eb15684b4
feat: use same trace ids in stack and otel (#1759)
# What does this PR do?
1) Uses otel compatible id generation for stack
2) Stack starts returning trace id info in the header of response
3) We inject the same trace id that we have into otel in order to force
it to use our trace ids.

## Test Plan
```
 curl -i --request POST \
  --url http://localhost:8321/v1/inference/chat-completion \
  --header 'content-type: application/json' \
  --data '{
  "model_id": "meta-llama/Llama-3.1-70B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": {
        "type": "text",
        "text": "where do humans live"
      }
    }
  ],
  "stream": false
}'
HTTP/1.1 200 OK
date: Fri, 21 Mar 2025 21:51:19 GMT
server: uvicorn
content-length: 1712
content-type: application/json
x-trace-id: 595101ede31ece116ebe35b26d67e8cf

{"metrics":[{"metric":"prompt_tokens","value":10,"unit":null},{"metric":"completion_tokens","value":320,"unit":null},{"metric":"total_tokens","value":330,"unit":null}],"completion_message":{"role":"assistant","content":"Humans live on the planet Earth, specifically on its landmasses and in its oceans. Here's a breakdown of where humans live:\n\n1. **Continents:** Humans inhabit all seven continents:\n\t* Africa\n\t* Antarctica ( temporary residents, mostly scientists and researchers)\n\t* Asia\n\t* Australia\n\t* Europe\n\t* North America\n\t* South America\n2. **Countries:** There are 196 countries recognized by the United Nations, and humans live in almost all of them.\n3. **Cities and towns:** Many humans live in urban areas, such as cities and towns, which are often located near coastlines, rivers, or other bodies of water.\n4. **Rural areas:** Some humans live in rural areas, such as villages, farms, and countryside.\n5. **Islands:** Humans inhabit many islands around the world, including tropical islands, island nations, and islands in the Arctic and Antarctic regions.\n6. **Underwater habitats:** A few humans live in underwater habitats, such as research stations and submarines.\n7. **Space:** A small number of humans have lived in space, including astronauts on the International Space Station and those who have visited the Moon.\n\nIn terms of specific environments, humans live in a wide range of ecosystems, including:\n\n* Deserts\n* Forests\n* Grasslands\n* Mountains\n* Oceans\n* Rivers\n* Tundras\n* Wetlands\n\nOverall, humans are incredibly adaptable and can be found living in almost every corner of the globe.","stop_reason":"end_of_turn","tool_calls":[]},"logprobs":null}
```

Same trace id in Jaeger and sqlite:

![Screenshot 2025-03-21 at 2 51
53 PM](https://github.com/user-attachments/assets/38cc04b0-568c-4b9d-bccd-d3b90e581c27)
![Screenshot 2025-03-21 at 2 52
38 PM](https://github.com/user-attachments/assets/722383ad-6305-4020-8a1c-6cfdf381c25f)
2025-03-21 15:41:26 -07:00
ehhuang
b9fbfed216
chore(telemetry): remove service_name entirely (#1755)
# What does this PR do?


## Test Plan

LLAMA_STACK_CONFIG=dev pytest -s -v
tests/integration/agents/test_agents.py::test_custom_tool
--safety-shield meta-llama/Llama-Guard-3-8B --text-model
accounts/fireworks/models/llama-v3p1-8b-instruct

and verify trace in jaeger UI
https://llama-stack.readthedocs.io/en/latest/building_applications/telemetry.html#
2025-03-21 15:11:56 -07:00
Xi Yan
baf68c665c
fix: fix jobs api literal return type (#1757)
# What does this PR do?

- We cannot directly return a literal type

> Note: this is not final jobs API change

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
<img width="837" alt="image"
src="https://github.com/user-attachments/assets/18a17561-35f9-443d-987d-54afdd6ff40c"
/>


[//]: # (## Documentation)
2025-03-21 14:04:21 -07:00
Ashwin Bharambe
d6887f46c6 fix: a couple of tests were broken and not yet exercised by our per-PR test workflow 2025-03-21 12:12:14 -07:00
ehhuang
34f89bfbd6
feat(telemetry): use zero-width space to avoid clutter (#1754)
# What does this PR do?
Before 
<img width="858" alt="image"
src="https://github.com/user-attachments/assets/6cefb1ae-5603-4818-85ea-a0c337b986bc"
/>

Note the redundant 'llama-stack' in front of every span

## Test Plan
<img width="1171" alt="image"
src="https://github.com/user-attachments/assets/bdc5fd5b-ff1f-4f10-8b40-cff2ea93dd1f"
/>
2025-03-21 12:02:10 -07:00
Derek Higgins
00917ef5b2
fix: Add 'accelerate' dependency to 'prompt-guard' (#1724)
Required to startup a distribution with prompt guard

Closes: #1723

## Test Plan
distribution starts with patch applied

Signed-off-by: Derek Higgins <derekh@redhat.com>
2025-03-21 07:37:20 -07:00
Ashwin Bharambe
03b5c61bfc
feat: make sure agent sessions are under access control (#1737)
This builds on top of #1703.

Agent sessions are now properly access controlled.

## Test Plan

Added unit tests
2025-03-21 07:31:16 -07:00
Dinesh Yeduguru
6104bd06a0
feat: add different sinks for otel traces and metrics (#1731)
# What does this PR do?
Since we now start recording and exporting metrics, we no longer can use
single OTEL endpoint to export both traces and metrics. This PR adds two
sinks: OTEL_TRACE and OTEL_METRIC to be able to selectively enable the
exporters.

## Test Plan
Start server with OTEL_TRACE as sink and verify traces show up in jaeger
![Screenshot 2025-03-20 at 3 12
25 PM](https://github.com/user-attachments/assets/51007f28-b5ed-4853-912a-965a5cfe83af)
2025-03-20 15:51:41 -07:00
Ihar Hrachyshka
515c16e352
chore: mypy violations cleanup for inline::{telemetry,tool_runtime,vector_io} (#1711)
# What does this PR do?

Clean up mypy violations for inline::{telemetry,tool_runtime,vector_io}.
This also makes API accept a tool call result without any content (like
RAG tool already may produce).

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-20 10:01:10 -07:00
Botao Chen
f369871083
feat: [New Eval Benchamark] IfEval (#1708)
# What does this PR do?
In this PR, we added a new eval open benchmark IfEval based on paper
https://arxiv.org/abs/2311.07911 to measure the model capability of
instruction following.


## Test Plan
spin up a llama stack server with open-benchmark template

run `llama-stack-client --endpoint xxx eval run-benchmark
"meta-reference-ifeval" --model-id "meta-llama/Llama-3.3-70B-Instruct"
--output-dir "/home/markchen1015/" --num-examples 20` on client side and
get the eval aggregate results
2025-03-19 16:39:59 -07:00
yyymeta
d117bfe597
feat: [new open benchmark] DocVQA (#1647)
# What does this PR do?
DocVQA asks model to look a a picture, then answer a question given in
text, with a text answer by text information in the picture. these
questions often require understanding of relative positions of texts
within the picture.

original dataset is defined in the "Task1" of
https://www.docvqa.org/datasets


## Test Plan
setup llama server with 

```
llama stack run ./llama_stack/templates/open-benchmark/run.yaml
```


then send traffic:

```
 llama-stack-client eval run-benchmark "meta-reference-docvqa"  --model-id   meta-llama/Llama-3.3-70B-Instruct     --output-dir /tmp/gpqa    --num-examples   200
```
2025-03-19 14:56:14 -07:00
Xi Yan
c1d18283d2
feat(eval api): (2.2/n) delete eval / scoring / scoring_fn apis (#1700)
# What does this PR do?
- To make it easier, delete existing `eval/scoring/scoring_function`
apis. There will be a bunch of broken impls here. The sequence is:
1. migrate benchmark graders
2. clean up existing scoring functions

- Add a skeleton evaluation impl to make tests pass. 

## Test Plan
tested in following PRs

[//]: # (## Documentation)
2025-03-19 11:04:23 -07:00
Derek Higgins
6949bd1999
fix: Call pandas.read_* in a seperate thread (#1698)
These block on io reads which in turn block the
server. Move them to their own thread.

Closes: #1697

# What does this PR do?
To avoid blocking the main eventloop, updates datasetio/localfs to load
data in a seperate thread

Signed-off-by: Derek Higgins <derekh@redhat.com>
2025-03-19 10:46:37 -07:00
Hardik Shah
65ca85ba6b
fix: Updating ToolCall.arguments to allow for json strings that can be decoded on client side (#1685)
### What does this PR do?

Currently, `ToolCall.arguments` is a `Dict[str, RecursiveType]`.
However, on the client SDK side -- the `RecursiveType` gets deserialized
into a number ( both int and float get collapsed ) and hence when params
are `int` they get converted to float which might break client side
tools that might be doing type checking.

Closes: https://github.com/meta-llama/llama-stack/issues/1683

### Test Plan
Stainless changes --
https://github.com/meta-llama/llama-stack-client-python/pull/204
```
pytest -s -v --stack-config=fireworks tests/integration/agents/test_agents.py  --text-model meta-llama/Llama-3.1-8B-Instruct
```
2025-03-19 10:36:19 -07:00
yyymeta
b79e0435de
fix: avoid tensor memory error (#1688)
# What does this PR do?

we randomly get errors like the following, it's most likely due to
accessing an object that is already deallocated

```

E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] Traceback (most recent call last):
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     fn(i, *args)
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 611, in _wrap
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     ret = record(fn)(*args_)
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     return f(*args, **kwargs)
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/internal-llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 249, in worker_process_entrypoint
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     task = req_gen.send(result)
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/internal-llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 156, in retrieve_requests
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     torch.distributed.broadcast_object_list(
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     return func(*args, **kwargs)
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3504, in broadcast_object_list
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     object_list[i] = _tensor_to_object(obj_view, obj_size, group)
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]   File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2961, in _tensor_to_object
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]     return _unpickler(io.BytesIO(buf)).load()
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] EOFError: Ran out of input
E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732]
Process SpawnProcess-1:
Traceback (most recent call last):
```

## Test Plan
start server

```
llama-stack-client eval run-benchmark mmmu_v1  --model-id meta-llama/Llama-4-17B-Omni-Instruct  --output-dir /tmp/mmmu_standard --num-examples 30
```

[//]: # (## Documentation)
2025-03-18 16:17:29 -07:00
Ihar Hrachyshka
0cbb7f7f21
chore: fix mypy violations in post_training modules (#1548)
# What does this PR do?

Fixes a bunch of violations.

Note: this patch touches all files but post_training.py that will be
significantly changed by #1437, hence leaving it out of the picture for
now.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Testing with https://github.com/meta-llama/llama-stack/pull/1543

Also checked that GPU training works with the change:

```
INFO:     ::1:53316 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 200 OK
INFO:     ::1:53316 - "GET /v1/post-training/job/status?job_uuid=test-jobb5ca2d84-d541-42f8-883b-762828b4c0e7 HTTP/1.1" 200 OK
INFO:     ::1:53316 - "GET /v1/post-training/job/artifacts?job_uuid=test-jobb5ca2d84-d541-42f8-883b-762828b4c0e7 HTTP/1.1" 200 OK
21:24:01.161 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (32526.75ms)
 21:23:28.769 [DEBUG] Setting manual seed to local seed 3918872849. Local seed is seed + rank = 3918872849 + 0
 21:23:28.996 [INFO] Identified model_type = Llama3_2. Ignoring output.weight in checkpoint in favor of the tok_embedding.weight tied weights.
 21:23:29.933 [INFO] Memory stats after model init:
        GPU peak memory allocation: 6.05 GiB
        GPU peak memory reserved: 6.10 GiB
        GPU peak memory active: 6.05 GiB
 21:23:29.934 [INFO] Model is initialized with precision torch.bfloat16.
 21:23:30.115 [INFO] Tokenizer is initialized.
 21:23:30.118 [INFO] Optimizer is initialized.
 21:23:30.119 [INFO] Loss is initialized.
 21:23:30.896 [INFO] Dataset and Sampler are initialized.
 21:23:30.898 [INFO] Learning rate scheduler is initialized.
 21:23:31.618 [INFO] Memory stats after model init:
        GPU peak memory allocation: 6.24 GiB
        GPU peak memory reserved: 6.30 GiB
        GPU peak memory active: 6.24 GiB
 21:23:31.620 [INFO] Starting checkpoint save...
 21:23:59.428 [INFO] Model checkpoint of size 6.43 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth
 21:23:59.445 [INFO] Adapter checkpoint of size 0.00 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth

```

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-18 14:58:16 -07:00
Sarthak Deshpande
5ece262976
chore: Make code interpreter async (#1654)
# What does this PR do?
 Made code interpreter tool call to be async such that its non blocking

## Test Plan
pytest -s -v tests/integration/agents/test_agents.py
--stack-config=together --text-model=meta-llama/Llama-3.3-70B-Instruct
<img width="1693" alt="image"
src="https://github.com/user-attachments/assets/42520bb6-7acf-42d5-b71f-b35ca149d722"
/>


[//]: # (## Documentation)

Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>
2025-03-18 14:13:46 -07:00
Daniele Martinoli
cca9bd6cc3
feat: Qdrant inline provider (#1273)
# What does this PR do?
Removed local execution option from the remote Qdrant provider and
introduced an explicit inline provider for the embedded execution.
Updated the ollama template to include this option: this part can be
reverted in case we don't want to have two default `vector_io`
providers.

(Closes #1082)

## Test Plan
Build and run an ollama distro:
```bash
llama stack build --template ollama --image-type conda
llama stack run --image-type conda ollama
```

Run one of the sample ingestionapplicatinos like
[rag_with_vector_db.py](https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/rag_with_vector_db.py),
but replace this line:
```py
    selected_vector_provider = vector_providers[0]
```
with the following, to use the `qdrant` provider:
```py
    selected_vector_provider = vector_providers[1]
```

After running the test code, verify the timestamp of the Qdrant store:
```bash
% ls -ltr ~/.llama/distributions/ollama/qdrant.db/collection/test_vector_db_*
total 784
-rw-r--r--@ 1 dmartino  staff  401408 Feb 26 10:07 storage.sqlite
```

[//]: # (## Documentation)

---------

Signed-off-by: Daniele Martinoli <dmartino@redhat.com>
Co-authored-by: Francisco Arceo <farceo@redhat.com>
2025-03-18 14:04:21 -07:00
Matthew Farrellee
706b4ca651
feat: support nvidia hosted vision models (llama 3.2 11b/90b) (#1278)
# What does this PR do?

support nvidia hosted 3.2 11b/90b vision models. they are not hosted on
the common https://integrate.api.nvidia.com/v1. they are hosted on their
own individual urls.

## Test Plan

`LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -s -v
tests/client-sdk/inference/test_vision_inference.py
--inference-model=meta/llama-3.2-11b-vision-instruct -k image`
2025-03-18 11:54:10 -07:00
Luis Tomas Bolivar
168cbcbb92
fix: Add the option to not verify SSL at remote-vllm provider (#1585)
# What does this PR do?
Add the option to not verify SSL certificates for the remote-vllm
provider. This allows llama stack server to talk to remote LLMs which
have self-signed certificates

Partially addresses  #1545
2025-03-18 09:33:35 -04:00
ehhuang
37f155e41d
feat(agent): support multiple tool groups (#1556)
Summary:
closes #1488 

Test Plan:
added new integration test
```
LLAMA_STACK_CONFIG=dev pytest -s -v tests/integration/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model openai/gpt-4o-mini
```
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1556).
* __->__ #1556
* #1550
2025-03-17 22:13:09 -07:00
ehhuang
c23a7af5d6
fix: agents with non-llama model (#1550)
# Summary:
Includes fixes to get test_agents working with openAI model, e.g. tool
parsing and message conversion

# Test Plan:
```
LLAMA_STACK_CONFIG=dev pytest -s -v tests/integration/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model openai/gpt-4o-mini
```

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1550).
* #1556
* __->__ #1550
2025-03-17 22:11:06 -07:00
Xi Yan
5287b437ae
feat(api): (1/n) datasets api clean up (#1573)
## PR Stack
- https://github.com/meta-llama/llama-stack/pull/1573
- https://github.com/meta-llama/llama-stack/pull/1625
- https://github.com/meta-llama/llama-stack/pull/1656
- https://github.com/meta-llama/llama-stack/pull/1657
- https://github.com/meta-llama/llama-stack/pull/1658
- https://github.com/meta-llama/llama-stack/pull/1659
- https://github.com/meta-llama/llama-stack/pull/1660

**Client SDK**
- https://github.com/meta-llama/llama-stack-client-python/pull/203

**CI**
- 1391130488
<img width="1042" alt="image"
src="https://github.com/user-attachments/assets/69636067-376d-436b-9204-896e2dd490ca"
/>
-- the test_rag_agent_with_attachments is flaky and not related to this
PR

## Doc
<img width="789" alt="image"
src="https://github.com/user-attachments/assets/b88390f3-73d6-4483-b09a-a192064e32d9"
/>


## Client Usage
```python
client.datasets.register(
    source={
        "type": "uri",
        "uri": "lsfs://mydata.jsonl",
    },
    schema="jsonl_messages",
    # optional 
    dataset_id="my_first_train_data"
)

# quick prototype debugging
client.datasets.register(
    data_reference={
        "type": "rows",
        "rows": [
                "messages": [...],
        ],
    },
    schema="jsonl_messages",
)
```

## Test Plan
- CI:
1387805545

```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/datasets/test_datasets.py
```

```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/scoring/test_scoring.py
```

```
pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb
```
2025-03-17 16:55:45 -07:00
cdgamarose-nv
252a487085
feat: added nvidia as safety provider (#1248)
# What does this PR do?
Adds nvidia as a safety provider by interfacing with the nemo guardrails
microservice.
This enables checking user’s input or the LLM’s output against input and
output guardrails by using the `/v1/guardrails/checks` endpoint of the[
guardrails
API.](https://developer.nvidia.com/docs/nemo-microservices/guardrails/source/guides/checks-guide.html)

## Test Plan
Deploy nemo guardrails service following the documentation:
https://developer.nvidia.com/docs/nemo-microservices/guardrails/source/getting-started/deploy-docker.html

### Standalone:
```bash
(venv) local-cdgamarose@a1u1g-rome-0153:~/llama-stack$ pytest -v -s llama_stack/providers/tests/safety/test_safety.py --providers inference=nvidia,safety=nvidia --safety-shield meta/llama-3.1-8b-instruct

=================================================================================== test session starts ===================================================================================
platform linux -- Python 3.10.12, pytest-8.3.4, pluggy-1.5.0 -- /localhome/local-cdgamarose/llama-stack/venv/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.10.12', 'Platform': 'Linux-5.15.0-122-generic-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0', 'html': '4.1.1'}}
rootdir: /localhome/local-cdgamarose/llama-stack
configfile: pyproject.toml
plugins: metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0, html-4.1.1
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 2 items

llama_stack/providers/tests/safety/test_safety.py::TestSafety::test_shield_list[--inference=nvidia:safety=nvidia] Initializing NVIDIASafetyAdapter(http://0.0.0.0:7331)...
PASSED
llama_stack/providers/tests/safety/test_safety.py::TestSafety::test_run_shield[--inference=nvidia:safety=nvidia] PASSED

============================================================================== 2 passed, 2 warnings in 4.78s ==============================================================================

```
### Distribution:
```
llama stack run llama_stack/templates/nvidia/run-with-safety.yaml
curl -v -X 'POST' "http://localhost:8321/v1/safety/run-shield" -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"shield_id": "meta/llama-3.1-8b-instruct", "messages":[{"role": "user", "content": "you are stupid"}]}'
{"violation":{"violation_level":"error","user_message":"Sorry I cannot do this.","metadata":{"self check input":{"status":"blocked"}}}}
```

[//]: # (## Documentation)

---------

Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-03-17 14:39:23 -07:00
yyymeta
fb418813fc
fix: passthrough impl response.content.text (#1665)
# What does this PR do?
current passthrough impl returns chatcompletion_message.content as a
TextItem() , not a straight string. so it's not compatible with other
providers, and causes parsing error downstream.

change away from the generic pydantic conversion, and explicitly parse
out content.text

## Test Plan

setup llama server with passthrough

```
llama-stack-client eval run-benchmark "MMMU_Pro_standard"   --model-id    meta-llama/Llama-3-8B   --output-dir /tmp/   --num-examples 20
```
works without parsing error
2025-03-17 13:42:08 -07:00
yyymeta
a626b7bce3
feat: [new open benchmark] BFCL_v3 (#1578)
# What does this PR do?
create a new dataset BFCL_v3 from
https://gorilla.cs.berkeley.edu/blogs/13_bfcl_v3_multi_turn.html

overall each question asks the model to perform a task described in
natural language, and additionally a set of available functions and
their schema are given for the model to choose from. the model is
required to write the function call form including function name and
parameters , to achieve the stated purpose. the results are validated
against provided ground truth, to make sure that the generated function
call and the ground truth function call are syntactically and
semantically equivalent, by checking their AST .



## Test Plan

start server by 

```
llama stack run ./llama_stack/templates/ollama/run.yaml
```

then send traffic
```
 llama-stack-client eval run-benchmark "bfcl"  --model-id   meta-llama/Llama-3.2-3B-Instruct    --output-dir /tmp/gpqa    --num-examples   2
```




[//]: # (## Documentation)
2025-03-14 12:50:49 -07:00
Sébastien Han
98b1b15e0f
refactor: move all datetime.now() calls to UTC (#1589)
# What does this PR do?

Updated all instances of datetime.now() to use timezone.utc for
consistency in handling time across different systems. This ensures that
timestamps are always in Coordinated Universal Time (UTC), avoiding
issues with time zone discrepancies and promoting uniformity in
time-related data.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-13 15:34:53 -07:00
Ashwin Bharambe
d072b5fa0c
test: add unit test to ensure all config types are instantiable (#1601) 2025-03-12 22:29:58 -07:00
ehhuang
a505bf45a3
feat(api): remove tool_name from ToolResponseMessage (#1599)
Summary:
This is not used anywhere.

closes #1421 

Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct --record-responses
2025-03-12 19:41:48 -07:00
ehhuang
6bfcb65343
test: code exec on mac (#1549)
Summary:
1. adds option to not use bwrap for code execution
2. disable bwrap when running tests on macs

Test Plan:
```
LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/integration/agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct
```

Verify code_interpreter result in logs

INFO 2025-03-11 08:10:39,858
llama_stack.providers.inline.agents.meta_reference.agent_instance:1032
agents: tool
call code_interpreter completed with result:
content='completed\n\n541\n' error_message=None error_code=None
         metadata=None
2025-03-12 19:21:53 -07:00
ehhuang
ed6caead72
chore: simplify _get_tool_defs (#1384)
Summary:

Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
2025-03-12 18:51:18 -07:00
ehhuang
41c9bca1aa
chore: refactor Agent toolgroup processing (#1381)
Summary:
Refactoring only.

Centralize logic to preprocess toolgroup to one place. 

Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/api/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B
---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/meta-llama/llama-stack/pull/1381).
* #1384
* __->__ #1381
2025-03-12 18:48:03 -07:00
ehhuang
b7a9c45477
chore: deprecate ToolResponseMessage in agent.resume API (#1566)
# Summary:
closes #1431 

# Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
2025-03-12 12:10:21 -07:00
Dinesh Yeduguru
58d08d100e
feat: Add back inference metrics and preserve context variables across asyncio boundary (#1552)
# What does this PR do?
This PR adds back the changes in #1300  which were reverted in  #1476 .

It also adds logic to preserve context variables across asyncio
boundary. this is needed with the library client since the async
generator logic yields control to code outside the event loop, and on
resuming, does not have the same context as before and this requires
preserving the context vars.

address #1477 
## Test Plan


```
 curl --request POST \
  --url http://localhost:8321/v1/inference/chat-completion \
  --header 'content-type: application/json' \
  --data '{
  "model_id": "meta-llama/Llama-3.1-70B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": {
        "type": "text",
        "text": "where do humans live"
      }
    }
  ],
  "stream": false
}' | jq .

{
  "metrics": [
    {
      "trace_id": "kCZwO3tyQC-FuAGb",
      "span_id": "bsP_5a5O",
      "timestamp": "2025-03-11T16:47:38.549084Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "prompt_tokens",
      "value": 10,
      "unit": "tokens"
    },
    {
      "trace_id": "kCZwO3tyQC-FuAGb",
      "span_id": "bsP_5a5O",
      "timestamp": "2025-03-11T16:47:38.549449Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "completion_tokens",
      "value": 369,
      "unit": "tokens"
    },
    {
      "trace_id": "kCZwO3tyQC-FuAGb",
      "span_id": "bsP_5a5O",
      "timestamp": "2025-03-11T16:47:38.549457Z",
      "attributes": {
        "model_id": "meta-llama/Llama-3.1-70B-Instruct",
        "provider_id": "fireworks"
      },
      "type": "metric",
      "metric": "total_tokens",
      "value": 379,
      "unit": "tokens"
    }
  ],
  "completion_message": {
    "role": "assistant",
    "content": "Humans live on the planet Earth, specifically on its landmasses and in its oceans. Here's a breakdown of where humans live:\n\n1. **Continents:** Humans inhabit all seven continents:\n\t* Africa\n\t* Antarctica ( temporary residents, mostly scientists and researchers)\n\t* Asia\n\t* Australia\n\t* Europe\n\t* North America\n\t* South America\n2. **Countries:** There are 196 countries recognized by the United Nations, and humans live in almost all of them.\n3. **Cities and towns:** Many humans live in urban areas, such as cities and towns, which are often located near coastlines, rivers, or other bodies of water.\n4. **Rural areas:** Some humans live in rural areas, such as villages, farms, and countryside.\n5. **Islands:** Humans inhabit many islands around the world, including those in the Pacific, Indian, and Atlantic Oceans.\n6. **Mountains and highlands:** Humans live in mountainous regions, such as the Himalayas, the Andes, and the Rocky Mountains.\n7. **Deserts:** Some humans live in desert regions, such as the Sahara, the Mojave, and the Atacama.\n8. **Coastal areas:** Many humans live in coastal areas, such as beaches, ports, and coastal cities.\n9. **Underwater habitats:** A few humans live in underwater habitats, such as research stations and submarines.\n10. **Space:** A small number of humans have lived in space, including astronauts on the International Space Station and those who have visited the Moon.\n\nOverall, humans can be found living in almost every environment on Earth, from the frozen tundra to the hottest deserts, and from the highest mountains to the deepest oceans.",
    "stop_reason": "end_of_turn",
    "tool_calls": []
  },
  "logprobs": null
}

```

Orignal repro no longer showing any error:
```
LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml
python -m examples.agents.e2e_loop_with_client_tools localhost 8321
```

client logs:
https://gist.github.com/dineshyv/047c7e87b18a5792aa660e311ea53166
server logs:
https://gist.github.com/dineshyv/97a2174099619e9916c7c490be26e559
2025-03-12 12:01:03 -07:00
Botao Chen
90ca4d94de
fix: fix passthrough inference provider to make it work for agent (#1577)
## What does this PR do?
We noticed that the passthrough inference provider doesn't work agent
due to the type mis-match between client and server. We manually cast
the llama stack client type to llama stack server type to fix the issue.

## test 
run `python -m examples.agents.hello localhost 8321` within
llama-stack-apps

<img width="1073" alt="Screenshot 2025-03-11 at 8 43 44 PM"
src="https://github.com/user-attachments/assets/bd1bdd31-606a-420c-a249-95f6184cc0b1"
/>

fix https://github.com/meta-llama/llama-stack/issues/1560
2025-03-12 11:16:17 -07:00
Botao Chen
0b0be70605
feat: Add open benchmark template codegen (#1579)
## What does this PR do?

As title, add codegen for open-benchmark template

## test 

checked the new generated run.yaml file and it's identical before and
after the change

Also add small improvement to together template so that missing
TOGETHER_API_KEY won't crash the server which is the consistent user
experience as other remote providers
2025-03-12 11:12:08 -07:00