Commit graph

43 commits

Author SHA1 Message Date
Rashmi Pawar
1a73f8305b
feat: Add nemo customizer (#1448)
# What does this PR do?

This PR adds support for NVIDIA's NeMo Customizer API to the Llama Stack
post-training module. The integration enables users to fine-tune models
using NVIDIA's cloud-based customization service through a consistent
Llama Stack interface.


[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
Yet to be done

Things pending under this PR:

- [x] Integration of fine-tuned model(new checkpoint) for inference with
nvidia llm distribution
- [x] distribution integration of API
- [x] Add test cases for customizer(In Progress)
- [x] Documentation

```

LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -v tests/client-sdk/post_training/test_supervised_fine_tuning.py 

============================================================================================================================================================================ test session starts =============================================================================================================================================================================
platform linux -- Python 3.10.0, pytest-8.3.4, pluggy-1.5.0 -- /home/ubuntu/llama-stack/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.0', 'Platform': 'Linux-6.8.0-1021-gcp-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'nbval': '0.11.0', 'metadata': '3.1.1', 'anyio': '4.8.0', 'html': '4.1.1', 'asyncio': '0.25.3'}}
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: nbval-0.11.0, metadata-3.1.1, anyio-4.8.0, html-4.1.1, asyncio-0.25.3
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 2 items                                                                                                                                                                                                                                                                                                                                                            

tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_post_training_provider_registration[txt=8B] PASSED                                                                                                                                                                                                                                                 [ 50%]
tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_list_training_jobs[txt=8B] PASSED                                                                                                                                                                                                                                                                  [100%]

======================================================================================================================================================================== 2 passed, 1 warning in 0.10s ========================================================================================================================================================================
```
cc: @mattf @dglogo @sumitb

---------

Co-authored-by: Ubuntu <ubuntu@llama-stack-customizer-dev-inst-2tx95fyisatvlic4we8hidx5tfj.us-central1-a.c.brevdevprod.internal>
2025-03-25 11:01:10 -07:00
Yuan Tang
441016bee8
feat: Support "stop" parameter in remote:vLLM (#1715)
# What does this PR do?

This adds support for "stop" parameter:
https://platform.openai.com/docs/api-reference/completions/create#completions-create-stop

## Test Plan

```
tests/integration/inference/test_text_inference.py::test_text_completion_non_streaming[txt=8B-inference:completion:sanity] PASSED                                  [  5%]
tests/integration/inference/test_text_inference.py::test_text_completion_streaming[txt=8B-inference:completion:sanity] PASSED                                      [ 11%]
tests/integration/inference/test_text_inference.py::test_text_completion_stop_sequence[txt=8B-inference:completion:stop_sequence] PASSED                           [ 16%]
tests/integration/inference/test_text_inference.py::test_text_completion_log_probs_non_streaming[txt=8B-inference:completion:log_probs] PASSED                     [ 22%]
tests/integration/inference/test_text_inference.py::test_text_completion_log_probs_streaming[txt=8B-inference:completion:log_probs] PASSED                         [ 27%]
tests/integration/inference/test_text_inference.py::test_text_completion_structured_output[txt=8B-inference:completion:structured_output] PASSED                   [ 33%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=8B-inference:chat_completion:non_streaming_01] PASSED              [ 38%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_non_streaming[txt=8B-inference:chat_completion:non_streaming_02] PASSED              [ 44%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_first_token_profiling[txt=8B-inference:chat_completion:ttft] ^TPASSED                  [ 50%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_streaming[txt=8B-inference:chat_completion:streaming_01] PASSED                      [ 55%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_streaming[txt=8B-inference:chat_completion:streaming_02] PASSED                      [ 61%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[txt=8B-inference:chat_completion:tool_calling] PASSED [ 66%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[txt=8B-inference:chat_completion:tool_calling] PASSED [ 72%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_required[txt=8B-inference:chat_completion:tool_calling] PASSED      [ 77%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_with_tool_choice_none[txt=8B-inference:chat_completion:tool_calling] PASSED          [ 83%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_structured_output[txt=8B-inference:chat_completion:structured_output] PASSED         [ 88%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-True] PASSED [ 94%]
tests/integration/inference/test_text_inference.py::test_text_chat_completion_tool_calling_tools_not_in_request[txt=8B-inference:chat_completion:tool_calling_tools_absent-False] PASSED [100%]

=============================================================== 18 passed, 3 warnings in 755.79s (0:12:35) ===============================================================
```

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-24 12:42:55 -07:00
Xi Yan
baf68c665c
fix: fix jobs api literal return type (#1757)
# What does this PR do?

- We cannot directly return a literal type

> Note: this is not final jobs API change

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
<img width="837" alt="image"
src="https://github.com/user-attachments/assets/18a17561-35f9-443d-987d-54afdd6ff40c"
/>


[//]: # (## Documentation)
2025-03-21 14:04:21 -07:00
Ihar Hrachyshka
515c16e352
chore: mypy violations cleanup for inline::{telemetry,tool_runtime,vector_io} (#1711)
# What does this PR do?

Clean up mypy violations for inline::{telemetry,tool_runtime,vector_io}.
This also makes API accept a tool call result without any content (like
RAG tool already may produce).

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-20 10:01:10 -07:00
Ihar Hrachyshka
5403582582
fix: Restore discriminator for AlgorithmConfig (#1706) 2025-03-20 07:33:26 -07:00
Charlie Doern
a483a58c6e
chore: deprecate /v1/inspect/providers (#1678)
# What does this PR do?

with the new /v1/providers API, /v1/inspect/providers is duplicative,
deprecate it by removing the route, and add a test for the full
/v1/providers API

resolves #1623 

## Test Plan

`uv run pytest -v tests/integration/providers --stack-config=ollama
--text-model="meta-llama/Llama-3.2-3B-Instruct"
--embedding-model=all-MiniLM-L6-v2`

<img width="1512" alt="Screenshot 2025-03-18 at 9 18 38 AM"
src="https://github.com/user-attachments/assets/2db30f25-3ff6-4374-b39d-0047f093fe36"
/>

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-03-19 20:27:06 -07:00
Botao Chen
f369871083
feat: [New Eval Benchamark] IfEval (#1708)
# What does this PR do?
In this PR, we added a new eval open benchmark IfEval based on paper
https://arxiv.org/abs/2311.07911 to measure the model capability of
instruction following.


## Test Plan
spin up a llama stack server with open-benchmark template

run `llama-stack-client --endpoint xxx eval run-benchmark
"meta-reference-ifeval" --model-id "meta-llama/Llama-3.3-70B-Instruct"
--output-dir "/home/markchen1015/" --num-examples 20` on client side and
get the eval aggregate results
2025-03-19 16:39:59 -07:00
Hardik Shah
65ca85ba6b
fix: Updating ToolCall.arguments to allow for json strings that can be decoded on client side (#1685)
### What does this PR do?

Currently, `ToolCall.arguments` is a `Dict[str, RecursiveType]`.
However, on the client SDK side -- the `RecursiveType` gets deserialized
into a number ( both int and float get collapsed ) and hence when params
are `int` they get converted to float which might break client side
tools that might be doing type checking.

Closes: https://github.com/meta-llama/llama-stack/issues/1683

### Test Plan
Stainless changes --
https://github.com/meta-llama/llama-stack-client-python/pull/204
```
pytest -s -v --stack-config=fireworks tests/integration/agents/test_agents.py  --text-model meta-llama/Llama-3.1-8B-Instruct
```
2025-03-19 10:36:19 -07:00
ehhuang
113f3a259c
docs: add documentation for RAGDocument (#1693)
# What does this PR do?


## Test Plan
2025-03-19 10:16:00 -07:00
Ihar Hrachyshka
0cbb7f7f21
chore: fix mypy violations in post_training modules (#1548)
# What does this PR do?

Fixes a bunch of violations.

Note: this patch touches all files but post_training.py that will be
significantly changed by #1437, hence leaving it out of the picture for
now.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Testing with https://github.com/meta-llama/llama-stack/pull/1543

Also checked that GPU training works with the change:

```
INFO:     ::1:53316 - "POST /v1/post-training/supervised-fine-tune HTTP/1.1" 200 OK
INFO:     ::1:53316 - "GET /v1/post-training/job/status?job_uuid=test-jobb5ca2d84-d541-42f8-883b-762828b4c0e7 HTTP/1.1" 200 OK
INFO:     ::1:53316 - "GET /v1/post-training/job/artifacts?job_uuid=test-jobb5ca2d84-d541-42f8-883b-762828b4c0e7 HTTP/1.1" 200 OK
21:24:01.161 [END] /v1/post-training/supervised-fine-tune [StatusCode.OK] (32526.75ms)
 21:23:28.769 [DEBUG] Setting manual seed to local seed 3918872849. Local seed is seed + rank = 3918872849 + 0
 21:23:28.996 [INFO] Identified model_type = Llama3_2. Ignoring output.weight in checkpoint in favor of the tok_embedding.weight tied weights.
 21:23:29.933 [INFO] Memory stats after model init:
        GPU peak memory allocation: 6.05 GiB
        GPU peak memory reserved: 6.10 GiB
        GPU peak memory active: 6.05 GiB
 21:23:29.934 [INFO] Model is initialized with precision torch.bfloat16.
 21:23:30.115 [INFO] Tokenizer is initialized.
 21:23:30.118 [INFO] Optimizer is initialized.
 21:23:30.119 [INFO] Loss is initialized.
 21:23:30.896 [INFO] Dataset and Sampler are initialized.
 21:23:30.898 [INFO] Learning rate scheduler is initialized.
 21:23:31.618 [INFO] Memory stats after model init:
        GPU peak memory allocation: 6.24 GiB
        GPU peak memory reserved: 6.30 GiB
        GPU peak memory active: 6.24 GiB
 21:23:31.620 [INFO] Starting checkpoint save...
 21:23:59.428 [INFO] Model checkpoint of size 6.43 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/consolidated.00.pth
 21:23:59.445 [INFO] Adapter checkpoint of size 0.00 GB saved to /home/ec2-user/.llama/checkpoints/meta-llama/Llama-3.2-3B-Instruct-sft-0/adapter/adapter.pth

```

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-18 14:58:16 -07:00
Sébastien Han
c029fbcd13
fix: return 4xx for non-existent resources in GET requests (#1635)
# What does this PR do?

- Removed Optional return types for GET methods
- Raised ValueError when requested resource is not found
- Ensures proper 4xx response for missing resources
- Updated the API generator to check for wrong signatures

```
$ uv run --with ".[dev]" ./docs/openapi_generator/run_openapi_generator.sh
Validating API method return types...

API Method Return Type Validation Errors:

Method ScoringFunctions.get_scoring_function returns Optional type
```

Closes: https://github.com/meta-llama/llama-stack/issues/1630

## Test Plan

Run the server then:

```
curl http://127.0.0.1:8321/v1/models/foo     
{"detail":"Invalid value: Model 'foo' not found"}%  
```

Server log:

```
INFO:     127.0.0.1:52307 - "GET /v1/models/foo HTTP/1.1" 400 Bad Request
09:51:42.654 [END] /v1/models/foo [StatusCode.OK] (134.65ms)
 09:51:42.651 [ERROR] Error executing endpoint route='/v1/models/{model_id:path}' method='get'
Traceback (most recent call last):
  File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py", line 193, in endpoint
    return await maybe_await(value)
  File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/server/server.py", line 156, in maybe_await
    return await value
  File "/Users/leseb/Documents/AI/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 102, in async_wrapper
    result = await method(self, *args, **kwargs)
  File "/Users/leseb/Documents/AI/llama-stack/llama_stack/distribution/routers/routing_tables.py", line 217, in get_model
    raise ValueError(f"Model '{model_id}' not found")
ValueError: Model 'foo' not found
```

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-18 14:06:53 -07:00
Xi Yan
5287b437ae
feat(api): (1/n) datasets api clean up (#1573)
## PR Stack
- https://github.com/meta-llama/llama-stack/pull/1573
- https://github.com/meta-llama/llama-stack/pull/1625
- https://github.com/meta-llama/llama-stack/pull/1656
- https://github.com/meta-llama/llama-stack/pull/1657
- https://github.com/meta-llama/llama-stack/pull/1658
- https://github.com/meta-llama/llama-stack/pull/1659
- https://github.com/meta-llama/llama-stack/pull/1660

**Client SDK**
- https://github.com/meta-llama/llama-stack-client-python/pull/203

**CI**
- 1391130488
<img width="1042" alt="image"
src="https://github.com/user-attachments/assets/69636067-376d-436b-9204-896e2dd490ca"
/>
-- the test_rag_agent_with_attachments is flaky and not related to this
PR

## Doc
<img width="789" alt="image"
src="https://github.com/user-attachments/assets/b88390f3-73d6-4483-b09a-a192064e32d9"
/>


## Client Usage
```python
client.datasets.register(
    source={
        "type": "uri",
        "uri": "lsfs://mydata.jsonl",
    },
    schema="jsonl_messages",
    # optional 
    dataset_id="my_first_train_data"
)

# quick prototype debugging
client.datasets.register(
    data_reference={
        "type": "rows",
        "rows": [
                "messages": [...],
        ],
    },
    schema="jsonl_messages",
)
```

## Test Plan
- CI:
1387805545

```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/datasets/test_datasets.py
```

```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/integration/scoring/test_scoring.py
```

```
pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb
```
2025-03-17 16:55:45 -07:00
Ashwin Bharambe
7b81761a56 fix: update CDN url for stoplight 2025-03-14 15:46:45 -07:00
Xi Yan
33b096cc21
fix: OpenAPI with provider get (#1627)
# What does this PR do?
- https://github.com/meta-llama/llama-stack/pull/1429 introduces
GetProviderResponse in OpenAPI, which is not needed, and not correctly
defined.

cc @cdoern 


[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
llama-stack-client providers list
```
<img width="610" alt="image"
src="https://github.com/user-attachments/assets/2f7b62a5-daf2-4bf9-9505-69755c7025fc"
/>


[//]: # (## Documentation)
2025-03-13 19:56:32 -07:00
Charlie Doern
a062723d03
feat: add provider API for listing and inspecting provider info (#1429)
# What does this PR do?

currently the `inspect` API for providers is really a `list` API. Create
a new `providers` API which has a GET `providers/{provider_id}` inspect
API
which returns "user friendly" configuration to the end user. Also add a
GET `/providers` endpoint which returns the list of providers as
`inspect/providers` does today.

This API follows CRUD and is more intuitive/RESTful.

This work is part of the RFC at
https://github.com/meta-llama/llama-stack/pull/1359

sensitive fields are redacted using `redact_sensetive_fields` on the
server side before returning a response:

<img width="456" alt="Screenshot 2025-03-13 at 4 40 21 PM"
src="https://github.com/user-attachments/assets/9465c221-2a26-42f8-a08a-6ac4a9fecce8"
/>


## Test Plan

using https://github.com/meta-llama/llama-stack-client-python/pull/181 a
user is able to to run the following:

`llama stack build --template ollama --image-type venv`
`llama stack run --image-type venv
~/.llama/distributions/ollama/ollama-run.yaml`
`llama-stack-client providers inspect ollama`

<img width="378" alt="Screenshot 2025-03-13 at 4 39 35 PM"
src="https://github.com/user-attachments/assets/8273d05d-8bc3-44c6-9e4b-ef95e48d5466"
/>


also, was able to run the new test_list integration test locally with
ollama:

<img width="1509" alt="Screenshot 2025-03-13 at 11 03 40 AM"
src="https://github.com/user-attachments/assets/9b9db166-f02f-45b0-86a4-306d85149bc8"
/>

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-03-13 15:07:21 -07:00
ehhuang
0a0d6cb96e
fix: openapi spec gen (#1602)
Summary:

Test Plan:
sh docs/openapi_generator/run_openapi_generator.sh
2025-03-12 21:55:05 -07:00
ehhuang
a505bf45a3
feat(api): remove tool_name from ToolResponseMessage (#1599)
Summary:
This is not used anywhere.

closes #1421 

Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct --record-responses
2025-03-12 19:41:48 -07:00
Dinesh Yeduguru
99bbe0e70b
feat: Add new compact MetricInResponse type (#1593)
# What does this PR do?
This change adds a compact type to include metrics in response as
opposed to the full MetricEvent which is relevant for internal logging
purposes.

## Test Plan
```
LLAMA_STACK_CONFIG=~/.llama/distributions/fireworks/fireworks-run.yaml pytest -s -v agents/test_agents.py --safety-shield meta-llama/Llama-Guard-3-8B --text-model meta-llama/Llama-3.1-8B-Instruct

 llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml

curl --request POST \
  --url http://localhost:8321/v1/inference/chat-completion \
  --header 'content-type: application/json' \
  --data '{
  "model_id": "meta-llama/Llama-3.1-70B-Instruct",
  "messages": [
    {
      "role": "user",
      "content": {
        "type": "text",
        "text": "where do humans live"
      }
    }
  ],
  "stream": false
}'

{
  "metrics": [
    {
      "metric": "prompt_tokens",
      "value": 10,
      "unit": null
    },
    {
      "metric": "completion_tokens",
      "value": 522,
      "unit": null
    },
    {
      "metric": "total_tokens",
      "value": 532,
      "unit": null
    }
  ],
  "completion_message": {
    "role": "assistant",
    "content": "Humans live in various parts of the world...............",
    "stop_reason": "out_of_tokens",
    "tool_calls": []
  },
  "logprobs": null
}
```
2025-03-12 15:45:44 -07:00
ehhuang
b7a9c45477
chore: deprecate ToolResponseMessage in agent.resume API (#1566)
# Summary:
closes #1431 

# Test Plan:
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --text-model
meta-llama/Llama-3.1-8B-Instruct
2025-03-12 12:10:21 -07:00
Sébastien Han
83a2c78615
feat(api): list agents / sessions and get agent (#1410)
# What does this PR do?

Add support for listing agents, describing an agent, and retrieving
session IDs for a given agent. This is only the API definition, the
implementations will come separately.

Closes: https://github.com/meta-llama/llama-stack/issues/1294

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-03-11 10:33:46 -07:00
ehhuang
6cf79437b3
feat: support ClientTool output metadata (#1426)
# Summary:
Client side change in
https://github.com/meta-llama/llama-stack-client-python/pull/180
Changes the resume_turn API to accept `ToolResponse` instead of
`ToolResponseMessage`:
1. `ToolResponse` contains `metadata`
2. `ToolResponseMessage` is a concept for model inputs. Here we are just
submitting the outputs of tool execution.

# Test Plan:
Ran integration tests with newly added test using client tool with
metadata

LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/agents/test_agents.py --safety-shield
meta-llama/Llama-Guard-3-8B --record-responses
2025-03-05 14:30:27 -08:00
Xi Yan
3d9331840e
docs: api documentation for agents/eval/scoring/datasets (#1400)
# What does this PR do?

- add some docs to OpenAPI for agents/eval/scoring/datasetio

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
- read

[//]: # (## Documentation)
2025-03-05 09:40:24 -08:00
Xi Yan
e9a37bad63
chore: rename task_config to benchmark_config (#1397)
# What does this PR do?

- This was missed from previous deprecation:
https://github.com/meta-llama/llama-stack/pull/1186
- Part of https://github.com/meta-llama/llama-stack/issues/1396

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
pytest -v -s --nbval-lax ./llama-stack/docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb 
```

[//]: # (## Documentation)
2025-03-04 12:44:04 -08:00
Xi Yan
158b6dc404
chore: deprecate allow_turn_resume (#1377)
# What does this PR do?

- Deprecate allow_turn_resume flag as this is used for staying backward
compat.
- Closes https://github.com/meta-llama/llama-stack/issues/1363

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
LLAMA_STACK_CONFIG=fireworks pytest -v tests/api/agents/test_agents.py --inference-model "meta-llama/Llama-3.3-70B-Instruct" --record-responses
```

<img width="1054" alt="image"
src="https://github.com/user-attachments/assets/d31de2d4-0953-41e1-a71a-7e1579fa351a"
/>


[//]: # (## Documentation)
2025-03-04 12:22:11 -08:00
Ashwin Bharambe
5547ef953c
feat: enhance OpenAPI spec to include Error types (#1320)
# What does this PR do?

An API spec must talk about Error handling. This was a pretty glaring
omission so far. This PR begins to address it by adding a set of
standard error responses we can attach to all our API calls.

At a future point, we can add specific error types where necessary
(although we should not hurry to do that; it is best done very late.)

## Test Plan

Checked that Stainless SDK generation succeeds.
2025-02-28 11:16:12 -08:00
ehhuang
25fddccfd8
feat: tool outputs metadata (#1155)
Summary:

Allows tools to output metadata. This is useful for evaluating tool
outputs, e.g. RAG tool will output document IDs, which can be used to
score recall.

Will need to make a similar change on the client side to support
ClientTool outputting metadata.

Test Plan:

LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/client-sdk/agents/test_agents.py
2025-02-21 13:15:31 -08:00
Xi Yan
0fe071764f
feat(1/n): api: unify agents for handling server & client tools (#1178)
# Problem

Our current Agent framework has discrepancies in definition on how we
handle server side and client side tools.

1. Server Tools: a single Turn is returned including `ToolExecutionStep`
in agenst
2. Client Tools: `create_agent_turn` is called in loop with client agent
lib yielding the agent chunk

ad6ffc63df/src/llama_stack_client/lib/agents/agent.py (L186-L211)

This makes it inconsistent to work with server & client tools. It also
complicates the logs to telemetry to get information about agents turn /
history for observability.

#### Principle
The same `turn_id` should be used to represent the steps required to
complete a user message including client tools.

## Solution

1. `AgentTurnResponseEventType.turn_awaiting_input` status to indicate
that the current turn is not completed, and awaiting tool input
2. `continue_agent_turn` endpoint to update agent turn with client's
tool response.


# What does this PR do?
- Skeleton API as example

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

- Just API update, no functionality change
```
llama stack run + client-sdk test
```

<img width="842" alt="image"
src="https://github.com/user-attachments/assets/7ac56b5f-f424-4632-9476-7e0f57555bc3"
/>


[//]: # (## Documentation)
2025-02-21 11:48:27 -08:00
Ashwin Bharambe
81ce39a607
feat(api): Add options for supporting various embedding models (#1192)
We need to support:
- asymmetric embedding models (#934)
- truncation policies (#933)
- varying dimensional output (#932) 

## Test Plan

```bash
$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```
2025-02-20 22:27:12 -08:00
Ashwin Bharambe
6f9d622340
fix(api): update embeddings signature so inputs and outputs list align (#1161)
See Issue #922 

The change is slightly backwards incompatible but no callsite (in our
client codebases or stack-apps) every passes a depth-2
`List[List[InterleavedContentItem]]` (which is now disallowed.)

## Test Plan

```bash
$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```

Also ran `tests/client-sdk/inference/test_embeddings.py`
2025-02-20 21:43:13 -08:00
ehhuang
1166afdf76
fix: some telemetry APIs don't currently work (#1188)
Summary:

This bug is surfaced by using the http LS client. The issue is that
non-scalar values in 'GET' method are `body` params in fastAPI, but our
spec generation script doesn't respect that. We fix by just making them
POST method instead.

Test Plan:
Test API call with newly sync'd client
(https://github.com/meta-llama/llama-stack-client-python/pull/149)

<img width="1114" alt="image"
src="https://github.com/user-attachments/assets/7710aca5-d163-4e00-a465-14e6fcaac2b2"
/>
2025-02-20 14:09:25 -08:00
Xi Yan
ea1faae50e
chore!: deprecate eval/tasks (#1186)
# What does this PR do?
- Fully deprecate eval/tasks

[//]: # (If resolving an issue, uncomment and update the line below)
Closes #1088 

NOTE: this will be a breaking change. We have introduced the new API in
0.1.3 .

Notebook has been updated to use the new endpoints.

## Test Plan
```
pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb 
```
<img width="611" alt="image"
src="https://github.com/user-attachments/assets/79f6efe1-81ba-494e-bf36-1fc0c2b9bc6f"
/>



cc @SLR722  for awareness

[//]: # (## Documentation)
2025-02-20 14:06:21 -08:00
Vladimir Ivić
f7161611c6
feat: adding endpoints for files and uploads (#1070)
Summary:
Adds spec definitions for file uploads operations.

This API focuses around two high level operations:
* Initiating and managing upload session
* Accessing uploaded file information

Usage examples:

To start a file upload session:
```
curl -X POST https://localhost:8321/v1/files \
-d '{
   "key": "image123.jpg',
   "bucket": "images",
   "mime_type": "image/jpg",
   "size": 12345
}'

# Returns
{
  “id”: <session_id>
  “url”: “https://localhost:8321/v1/files/session:<session_id>”,
  "offset": 0,
  "size": 12345
}

```

To upload file content to an existing session
```
curl -i -X POST "https://localhost:8321/v1/files/session:<session_id> \
  --data-binary @<path_to_local_file>

# Returns
{
  "key": "image123.jpg",
  "bucket": "images",
  "mime_type": "image/jpg",
  "bytes": 12345,
  "created_at": 1737492240
}

# Implementing on server side (Flask example for simplicity):
@app.route('/uploads/{upload_id}', methods=['POST'])
def upload_content_to_session(upload_id):
    try:
        # Get the binary file data from the request body
        file_data = request.data

        # Save the file to disk
        save_path = f"./uploads/{upload_id}"
        with open(save_path, 'wb') as f:
            f.write(file_data)
        return {__uploaded_file_json__}, 200
    except Exception as e:
        return 500

```

To read information about an existing upload session
```
curl -i -X GET "https://localhost:8321/v1/files/session:<session_id>

# Returns
{
  “id”: <session_id>
  “url”: “https://localhost:8321/v1/files/session:<session_id>”,
  "offset": 1024,
  "size": 12345
}
```

To list buckets
```
GET /files

# Returns
{
  "data": [
     {"name": "bucket1"},
     {"name": "bucket2"},
   ]
}
```

To list all files in a bucket
```
GET /files/{bucket}

# Returns
{
  "data": [
    {
      "key": "shiba.jpg",
      "bucket": "dogs",
      "mime_type": "image/jpg",
      "bytes": 82334,
      "created_at": 1737492240,
    },
    {
      "key": "persian_cat.jpg",
      "mime_type": "image/jpg",
      "bucket": "cats",
      "bytes": 39924,
      "created_at": 1727493440,
    },
  ]
}
```

To get specific file info
```
GET /files/{bucket}/{key}

{
  "key": "shiba.jpg",
  "bucket": "dogs",
  "mime_type": "image/jpg",
  "bytes": 82334,
  "created_at": 1737492240,
}

```

To delete specific file
```
DELETE /files/{bucket}/{key}

{
  "key": "shiba.jpg",
  "bucket": "dogs",
  "mime_type": "image/jpg",
  "bytes": 82334,
  "created_at": 1737492240,
}

```
2025-02-20 13:09:00 -08:00
Ashwin Bharambe
034ece0011 Ensure that deprecations for fields follow through to OpenAPI 2025-02-19 13:54:04 -08:00
Ashwin Bharambe
31a5ba5268 Add title to the json schemas 2025-02-19 13:26:39 -08:00
ehhuang
8de7cf103b
feat: support tool_choice = {required, none, <function>} (#1059)
Summary:

titled


Test Plan:

added tests and

LLAMA_STACK_CONFIG=fireworks pytest -s -v tests/client-sdk/
--safety-shield meta-llama/Llama-Guard-3-8B
2025-02-18 23:25:15 -05:00
Yuan Tang
64328bfe62
fix: enable_session_persistence in AgentConfig should be optional (#1012)
# What does this PR do?
This issue was discovered in
https://github.com/meta-llama/llama-stack/pull/1009#discussion_r1947036518.

## Test Plan

This field is no longer required after the change.

[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-02-14 09:19:53 -08:00
Xi Yan
da53dc3f5f
fix: openapi for eval-task (#1085)
# What does this PR do?
- as title

## Test Plan
- the deprecated endpoint need to obey what it was before

[//]: # (## Documentation)
2025-02-13 17:10:45 -08:00
Xi Yan
8b655e3cd2
fix!: update eval-tasks -> benchmarks (#1032)
# What does this PR do?

- Update `/eval-tasks` to `/benchmarks`
- ⚠️ Remove differentiation between `app` v.s. `benchmark` eval task
config. Now we only have `BenchmarkConfig`. The overloaded `benchmark`
is confusing and do not add any value. Backward compatibility is being
kept as the "type" is not being used anywhere.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
- This change is backward compatible 
- Run notebook test with

```
pytest -v -s --nbval-lax ./docs/getting_started.ipynb
pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb
```

<img width="846" alt="image"
src="https://github.com/user-attachments/assets/d2fc06a7-593a-444f-bc1f-10ab9b0c843d"
/>



[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)

---------

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Co-authored-by: Ben Browning <ben324@gmail.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com>
Co-authored-by: reidliu <reid201711@gmail.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-13 16:40:58 -08:00
Dinesh Yeduguru
d8a20e034b
feat: make telemetry attributes be dict[str,PrimitiveType] (#1055)
# What does this PR do?
Make attributes in telemetry be only primitive types and avoid arbitrary
nesting.

## Test Plan
```
 LLAMA_STACK_DISABLE_VERSION_CHECK=true llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v tests/client-sdk/agents/test_agents.py  -k "test_builtin_tool_web_search"
# Verified that attributes still show up correclty in jaeger
```
2025-02-11 15:10:17 -08:00
Dinesh Yeduguru
ab7f802698
feat: add MetricResponseMixin to chat completion response types (#1050)
# What does this PR do?
Defines a MetricResponseMixin which can be inherited by any response
class. Adds it to chat completion response types.


This is a short term solution to allow inference API to return metrics
The ideal way to do this is to have a way for all response types to
include metrics
and all metric events logged to the telemetry API to be included with
the response
To do this, we will need to augment all response types with a metrics
field.
We have hit a blocker from stainless SDK that prevents us from doing
this.
The blocker is that if we were to augment the response types that have a
data field
in them like so
class ListModelsResponse(BaseModel):
    metrics: Optional[List[MetricEvent]] = None
    data: List[Models]
    ...
The client SDK will need to access the data by using a .data field,
which is not
ergonomic. Stainless SDK does support unwrapping the response type, but
it
requires that the response type to only have a single field.

We will need a way in the client SDK to signal that the metrics are
needed
and if they are needed, the client SDK has to return the full response
type
without unwrapping it.

## Test Plan
sh run_openapi_generator.sh ./
sh stainless_sync.sh dineshyv/dev add-metrics-to-resp-v4

LLAMA_STACK_CONFIG="/Users/dineshyv/.llama/distributions/fireworks/fireworks-run.yaml"
pytest -v tests/client-sdk/agents/test_agents.py
2025-02-11 14:58:12 -08:00
ehhuang
3922999118
sys_prompt support in Agent (#938)
# What does this PR do?

The current default system prompt for llama3.2 tends to overindex on
tool calling and doesn't work well when the prompt does not require tool
calling.

This PR adds an option to override the default system prompt, and
organizes tool-related configs into a new config object.

- [ ] Addresses issue (#issue)


## Test Plan


LLAMA_STACK_CONFIG=together pytest
\-\-inference\-model=meta\-llama/Llama\-3\.3\-70B\-Instruct -s -v
tests/client-sdk/agents/test_agents.py::test_override_system_message_behavior


## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2025-02-05 21:11:32 -08:00
Ashwin Bharambe
474c4bdd7a
Make a couple properties optional (#963) 2025-02-04 16:20:24 -08:00
Ashwin Bharambe
d67401c644 Several documentation fixes and fix link to API reference 2025-02-04 14:00:43 -08:00
Renamed from docs/resources/llama-stack-spec.html (Browse further)