Commit graph

644 commits

Author SHA1 Message Date
Abhishek koserwal
5a3d777b20
feat: add llama stack rm command (#2127)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

```
llama stack rm llamastack-test
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
#225 

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)
2025-05-21 10:25:51 +02:00
Derek Higgins
3339844fda
feat: Add "instructions" support to responses API (#2205)
# What does this PR do?
Add support for "instructions" to the responses API. Instructions
provide a way to swap out system (or developer) messages in new
responses.


## Test Plan
unit tests added

Signed-off-by: Derek Higgins <derekh@redhat.com>
2025-05-20 09:52:10 -07:00
ehhuang
047303e339
feat: introduce APIs for retrieving chat completion requests (#2145)
# What does this PR do?
This PR introduces APIs to retrieve past chat completion requests, which
will be used in the LS UI.

Our current `Telemetry` is ill-suited for this purpose as it's untyped
so we'd need to filter by obscure attribute names, making it brittle.

Since these APIs are 'provided by stack' and don't need to be
implemented by inference providers, we introduce a new InferenceProvider
class, containing the existing inference protocol, which is implemented
by inference providers.

The APIs are OpenAI-compliant, with an additional `input_messages`
field.


## Test Plan
This PR just adds the API and marks them provided_by_stack. S
tart stack server -> doesn't crash
2025-05-18 21:43:19 -07:00
Charlie Doern
f02f7b28c1
feat: add huggingface post_training impl (#2132)
# What does this PR do?


adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a
super popular option for running training jobs. The config allows a user
to specify some key fields such as a model, chat_template, device, etc

the provider comes with one recipe `finetune_single_device` which works
both with and without LoRA.

any model that is a valid HF identifier can be given and the model will
be pulled.

this has been tested so far with CPU and MPS device types, but should be
compatible with CUDA out of the box

The provider processes the given dataset into the proper format,
establishes the various steps per epoch, steps per save, steps per eval,
sets a sane SFTConfig, and runs n_epochs of training

if checkpoint_dir is none, no model is saved. If there is a checkpoint
dir, a model is saved every `save_steps` and at the end of training.


## Test Plan

re-enabled post_training integration test suite with a singular test
that loads the simpleqa dataset:
https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite
model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The
test now uses the llama stack client and the proper post_training API

runs one step with a batch_size of 1. This test runs on CPU on the
Ubuntu runner so it needs to be a small batch and a single step.

[//]: # (## Documentation)

---------

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-05-16 14:41:28 -07:00
Charlie Doern
1ae61e8d5f
fix: replace all instances of --yaml-config with --config (#2196)
# What does this PR do?

start_stack.sh was using --yaml-config which is deprecated.

a bunch of distro docs also mentioned --yaml-config. Replaces all
instances and logic for --yaml-config with --config

resolves #2189

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-05-16 14:31:12 -07:00
grs
b8f7e1504d
feat: allow the interface on which the server will listen to be configured (#2015)
# What does this PR do?

It may not always be desirable to listen on all interfaces, which is the
default. As an example, by listening instead only on a loopback
interface, the server cannot be reached except from within the host it
is run on. This PR makes this configurable, through a CLI option, an env
var or an entry on the config file.

## Test Plan

I ran a server with and without the added CLI argument to verify that
the argument is used if provided, but the default is as it was before if
not.

Signed-off-by: Gordon Sim <gsim@redhat.com>
2025-05-16 12:59:31 -07:00
Sébastien Han
bb5fca9521
chore: more API validators (#2165)
# What does this PR do?

We added:

* make sure docstrings are present with 'params' and 'returns'
* fail if someone sets 'returns: None'
* fix the failing APIs

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-15 11:22:51 -07:00
Charlie Doern
e46de23be6
feat: refactor external providers dir (#2049)
# What does this PR do?

currently the "default" dir for external providers is
`/etc/llama-stack/providers.d`

This dir is not used anywhere nor created.

Switch to a more friendly `~/.llama/providers.d/`

This allows external providers to actually create this dir and/or
populate it upon installation, `pip` cannot create directories in `etc`.

If a user does not specify a dir, default to this one

see https://github.com/containers/ramalama-stack/issues/36

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-05-15 20:17:03 +02:00
Yuan Tang
7e25c8df28
fix: ReadTheDocs should display all versions (#2172)
# What does this PR do?

Currently the website only displays the "latest" version. This is
because our config and workflow do not include version information. This
PR adds missing version info.

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-05-15 11:41:15 -04:00
Francisco Arceo
8e7ab146f8
feat: Adding support for customizing chunk context in RAG insertion and querying (#2134)
# What does this PR do?
his PR allows users to customize the template used for chunks when
inserted into the context. Additionally, this enables metadata injection
into the context of an LLM for RAG. This makes a naive and crude
assumption that each chunk should include the metadata, this is
obviously redundant when multiple chunks are returned from the same
document. In order to remove any sort of duplication of chunks, we'd
have to make much more significant changes so this is a reasonable first
step that unblocks users requesting this enhancement in
https://github.com/meta-llama/llama-stack/issues/1767.

In the future, this can be extended to support citations.


List of Changes:
- `llama_stack/apis/tools/rag_tool.py`
    - Added  `chunk_template` field in `RAGQueryConfig`.
- Added `field_validator` to validate the `chunk_template` field in
`RAGQueryConfig`.
- Ensured the `chunk_template` field includes placeholders `{index}` and
`{chunk.content}`.
- Updated the `query` method to use the `chunk_template` for formatting
chunk text content.
- `llama_stack/providers/inline/tool_runtime/rag/memory.py`
- Modified the `insert` method to pass `doc.metadata` for chunk
creation.
- Enhanced the `query` method to format results using `chunk_template`
and exclude unnecessary metadata fields like `token_count`.
- `llama_stack/providers/utils/memory/vector_store.py`
- Updated `make_overlapped_chunks` to include metadata serialization and
token count for both content and metadata.
    - Added error handling for metadata serialization issues.
- `pyproject.toml`
- Added `pydantic.field_validator` as a recognized `classmethod`
decorator in the linting configuration.
- `tests/integration/tool_runtime/test_rag_tool.py`
- Refactored test assertions to separate `assert_valid_chunk_response`
and `assert_valid_text_response`.
- Added integration tests to validate `chunk_template` functionality
with and without metadata inclusion.
- Included a test case to ensure `chunk_template` validation errors are
raised appropriately.
- `tests/unit/rag/test_vector_store.py`
- Added unit tests for `make_overlapped_chunks`, verifying chunk
creation with overlapping tokens and metadata integrity.
- Added tests to handle metadata serialization errors, ensuring proper
exception handling.
- `docs/_static/llama-stack-spec.html`
- Added a new `chunk_template` field of type `string` with a default
template for formatting retrieved chunks in RAGQueryConfig.
    - Updated the `required` fields to include `chunk_template`.
- `docs/_static/llama-stack-spec.yaml`
- Introduced `chunk_template` field with a default value for
RAGQueryConfig.
- Updated the required configuration list to include `chunk_template`.
- `docs/source/building_applications/rag.md`
- Documented the `chunk_template` configuration, explaining how to
customize metadata formatting in RAG queries.
- Added examples demonstrating the usage of the `chunk_template` field
in RAG tool queries.
    - Highlighted default values for `RAG` agent configurations.

# Resolves https://github.com/meta-llama/llama-stack/issues/1767

## Test Plan
Updated both `test_vector_store.py` and `test_rag_tool.py` and tested
end-to-end with a script.

I also tested the quickstart to enable this and specified this metadata:
```python
document = RAGDocument(
    document_id="document_1",
    content=source,
    mime_type="text/html",
    metadata={"author": "Paul Graham", "title": "How to do great work"},
)
```
Which produced the output below: 

![Screenshot 2025-05-13 at 10 53
43 PM](https://github.com/user-attachments/assets/bb199d04-501e-4217-9c44-4699d43d5519)

This highlights the usefulness of the additional metadata. Notice how
the metadata is redundant for different chunks of the same document. I
think we can update that in a subsequent PR.

# Documentation
I've added a brief comment about this in the documentation to outline
this to users and updated the API documentation.

---------

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-05-14 21:56:20 -04:00
Ihar Hrachyshka
1de0dfaab5
docs: Clarify kfp provider is both inline and remote (#2144)
The provider selling point *is* using the same provider for both.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-05-14 09:37:07 +02:00
Ben Browning
8e316c9b1e
feat: function tools in OpenAI Responses (#2094)
# What does this PR do?

This is a combination of what was previously 3 separate PRs - #2069,
#2075, and #2083. It turns out all 3 of those are needed to land a
working function calling Responses implementation. The web search
builtin tool was already working, but this wires in support for custom
function calling.

I ended up combining all three into one PR because they all had lots of
merge conflicts, both with each other but also with #1806 that just
landed. And, because landing any of them individually would have only
left a partially working implementation merged.

The new things added here are:
* Storing of input items from previous responses and restoring of those
input items when adding previous responses to the conversation state
* Handling of multiple input item messages roles, not just "user"
messages.
* Support for custom tools passed into the Responses API to enable
function calling outside of just the builtin websearch tool.

Closes #2074
Closes #2080

## Test Plan

### Unit Tests

Several new unit tests were added, and they all pass. Ran via:

```
python -m pytest -s -v tests/unit/providers/agents/meta_reference/test_openai_responses.py
```

### Responses API Verification Tests

I ran our verification run.yaml against multiple providers to ensure we
were getting a decent pass rate. Specifically, I ensured the new custom
tool verification test passed across multiple providers and that the
multi-turn examples passed across at least some of the providers (some
providers struggle with the multi-turn workflows still).

Running the stack setup for verification testing:

```
llama stack run --image-type venv tests/verifications/openai-api-verification-run.yaml
```

Together, passing 100% as an example:

```
pytest -s -v 'tests/verifications/openai_api/test_responses.py' --provider=together-llama-stack
```

## Documentation

We will need to start documenting the OpenAI APIs, but for now the
Responses stuff is still rapidly evolving so delaying that.

---------

Signed-off-by: Derek Higgins <derekh@redhat.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-05-13 11:29:15 -07:00
Nathan Weinberg
e0d10dd0b1
docs: revamp testing documentation (#2155)
# What does this PR do?
reduces duplication and centralizes information to be easier to find for
contributors

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
2025-05-13 11:28:29 -07:00
Divya
c985ea6326
fix: Adding Embedding model to watsonx inference (#2118)
# What does this PR do?
Issue Link : https://github.com/meta-llama/llama-stack/issues/2117

## Test Plan
Once added, User will be able to use Sentence Transformer model
`all-MiniLM-L6-v2`
2025-05-12 10:58:22 -07:00
Sébastien Han
53b7f50828
chore: force ellipsis in API webmethods (#2141)
# What does this PR do?

This new check will fail if some webmethods are missing the ellipsis:

```
API Method Return Type Validation Errors:

Method Api.eval.job_result does not contain ellipsis (...) in its implementation
Method Api.agents.create_agent_turn does not contain ellipsis (...) in its implementation
Method Api.agents.create_openai_response does not contain ellipsis (...) in its implementation
Method Api.eval.evaluate_rows does not contain ellipsis (...) in its implementation
Method Api.eval.run_eval does not contain ellipsis (...) in its implementation
```

Unless not implemented.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-12 10:55:39 -07:00
Sébastien Han
43e623eea6
chore: remove last instances of code-interpreter provider (#2143)
Was removed in https://github.com/meta-llama/llama-stack/pull/2087

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-12 10:54:43 -07:00
Ashwin Bharambe
473a07f624
fix: revert "feat(provider): adding llama4 support in together inference provider (#2123)" (#2124)
This reverts commit 0f878ad87a.

The llama4 models already existed for Together.

cc @yogishbaliga @bbrowning
2025-05-08 15:18:16 -07:00
Yogish Baliga
0f878ad87a
feat(provider): adding llama4 support in together inference provider (#2123)
# What does this PR do?
Adding Llama4 model support in TogetherAI provider
2025-05-08 14:27:56 -07:00
Dinesh Yeduguru
fe5f5e530c
feat: add metrics query API (#1394)
# What does this PR do?
Adds the API to query metrics from telemetry.

## Test Plan
llama stack run ~/.llama/distributions/fireworks/fireworks-run.yaml

---------

Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-05-07 10:11:26 -07:00
Sébastien Han
c91e3552a3
feat: implementation for agent/session list and describe (#1606)
Create a new agent:

```
curl --request POST \
  --url http://localhost:8321/v1/agents \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --data '{
  "agent_config": {
    "sampling_params": {
      "strategy": {
        "type": "greedy"
      },
      "max_tokens": 0,
      "repetition_penalty": 1
    },
    "input_shields": [
      "string"
    ],
    "output_shields": [
      "string"
    ],
    "toolgroups": [
      "string"
    ],
    "client_tools": [
      {
        "name": "string",
        "description": "string",
        "parameters": [
          {
            "name": "string",
            "parameter_type": "string",
            "description": "string",
            "required": true,
            "default": null
          }
        ],
        "metadata": {
          "property1": null,
          "property2": null
        }
      }
    ],
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "tool_config": {
      "tool_choice": "auto",
      "tool_prompt_format": "json",
      "system_message_behavior": "append"
    },
    "max_infer_iters": 10,
    "model": "string",
    "instructions": "string",
    "enable_session_persistence": false,
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "property1": null,
        "property2": null
      }
    }
  }
}'
```

Get agent:

```
curl http://127.0.0.1:8321/v1/agents/9abad4ab-2c77-45f9-9d16-46b79d2bea1f
{"agent_id":"9abad4ab-2c77-45f9-9d16-46b79d2bea1f","agent_config":{"sampling_params":{"strategy":{"type":"greedy"},"max_tokens":0,"repetition_penalty":1.0},"input_shields":["string"],"output_shields":["string"],"toolgroups":["string"],"client_tools":[{"name":"string","description":"string","parameters":[{"name":"string","parameter_type":"string","description":"string","required":true,"default":null}],"metadata":{"property1":null,"property2":null}}],"tool_choice":"auto","tool_prompt_format":"json","tool_config":{"tool_choice":"auto","tool_prompt_format":"json","system_message_behavior":"append"},"max_infer_iters":10,"model":"string","instructions":"string","enable_session_persistence":false,"response_format":{"type":"json_schema","json_schema":{"property1":null,"property2":null}}},"created_at":"2025-03-12T16:18:28.369144Z"}%
```

List agents:

```
curl http://127.0.0.1:8321/v1/agents|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1680  100  1680    0     0   498k      0 --:--:-- --:--:-- --:--:--  546k
{
  "data": [
    {
      "agent_id": "9abad4ab-2c77-45f9-9d16-46b79d2bea1f",
      "agent_config": {
        "sampling_params": {
          "strategy": {
            "type": "greedy"
          },
          "max_tokens": 0,
          "repetition_penalty": 1.0
        },
        "input_shields": [
          "string"
        ],
        "output_shields": [
          "string"
        ],
        "toolgroups": [
          "string"
        ],
        "client_tools": [
          {
            "name": "string",
            "description": "string",
            "parameters": [
              {
                "name": "string",
                "parameter_type": "string",
                "description": "string",
                "required": true,
                "default": null
              }
            ],
            "metadata": {
              "property1": null,
              "property2": null
            }
          }
        ],
        "tool_choice": "auto",
        "tool_prompt_format": "json",
        "tool_config": {
          "tool_choice": "auto",
          "tool_prompt_format": "json",
          "system_message_behavior": "append"
        },
        "max_infer_iters": 10,
        "model": "string",
        "instructions": "string",
        "enable_session_persistence": false,
        "response_format": {
          "type": "json_schema",
          "json_schema": {
            "property1": null,
            "property2": null
          }
        }
      },
      "created_at": "2025-03-12T16:18:28.369144Z"
    },
    {
      "agent_id": "a6643aaa-96dd-46db-a405-333dc504b168",
      "agent_config": {
        "sampling_params": {
          "strategy": {
            "type": "greedy"
          },
          "max_tokens": 0,
          "repetition_penalty": 1.0
        },
        "input_shields": [
          "string"
        ],
        "output_shields": [
          "string"
        ],
        "toolgroups": [
          "string"
        ],
        "client_tools": [
          {
            "name": "string",
            "description": "string",
            "parameters": [
              {
                "name": "string",
                "parameter_type": "string",
                "description": "string",
                "required": true,
                "default": null
              }
            ],
            "metadata": {
              "property1": null,
              "property2": null
            }
          }
        ],
        "tool_choice": "auto",
        "tool_prompt_format": "json",
        "tool_config": {
          "tool_choice": "auto",
          "tool_prompt_format": "json",
          "system_message_behavior": "append"
        },
        "max_infer_iters": 10,
        "model": "string",
        "instructions": "string",
        "enable_session_persistence": false,
        "response_format": {
          "type": "json_schema",
          "json_schema": {
            "property1": null,
            "property2": null
          }
        }
      },
      "created_at": "2025-03-12T16:17:12.811273Z"
    }
  ]
}
```

Create sessions:

```
curl --request POST \
  --url http://localhost:8321/v1/agents/{agent_id}/session \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --data '{
  "session_name": "string"
}'
```

List sessions:

```
 curl http://127.0.0.1:8321/v1/agents/9abad4ab-2c77-45f9-9d16-46b79d2bea1f/sessions|jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   263  100   263    0     0  90099      0 --:--:-- --:--:-- --:--:--  128k
[
  {
    "session_id": "2b15c4fc-e348-46c1-ae32-f6d424441ac1",
    "session_name": "string",
    "turns": [],
    "started_at": "2025-03-12T17:19:17.784328"
  },
  {
    "session_id": "9432472d-d483-4b73-b682-7b1d35d64111",
    "session_name": "string",
    "turns": [],
    "started_at": "2025-03-12T17:19:19.885834"
  }
]
```

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-07 14:49:23 +02:00
Jorge Piedrahita Ortiz
b2b00a216b
feat(providers): sambanova updated to use LiteLLM openai-compat (#1596)
# What does this PR do?

switch sambanova inference adaptor to LiteLLM usage to simplify
integration and solve issues with current adaptor when streaming and
tool calling, models and templates updated

## Test Plan
pytest -s -v tests/integration/inference/test_text_inference.py
--stack-config=sambanova
--text-model=sambanova/Meta-Llama-3.3-70B-Instruct

pytest -s -v tests/integration/inference/test_vision_inference.py
--stack-config=sambanova
--vision-model=sambanova/Llama-3.2-11B-Vision-Instruct
2025-05-06 16:50:22 -07:00
Sébastien Han
1a529705da
chore: more mypy fixes (#2029)
# What does this PR do?

Mainly tried to cover the entire llama_stack/apis directory, we only
have one left. Some excludes were just noop.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-06 09:52:31 -07:00
Christian Zaccaria
feb9eb8b0d
docs: Remove datasets.rst and fix llama-stack build commands (#2061)
# Issue
Closes #2073 

# What does this PR do?
- Removes the `datasets.rst` from the list of document urls as it no
longer exists in torchtune. Referenced PR:
https://github.com/pytorch/torchtune/pull/1781

- Added a step to run `uv sync`. Previously, I would get the following
error:

```
➜  llama-stack git:(remove-deprecated-rst) uv venv --python 3.10
source .venv/bin/activate
Using CPython 3.10.13 interpreter at: /usr/bin/python3.10
Creating virtual environment at: .venv
Activate with: source .venv/bin/activate
(llama-stack) ➜  llama-stack git:(remove-deprecated-rst) INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type venv --run
zsh: llama: command not found...

```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

To test: Run through `rag_agent` example in the `detailed_tutorial.md`
file.

[//]: # (## Documentation)
2025-05-06 09:51:20 -07:00
Divya
3022f7b642
feat: Adding TLS support for Remote::Milvus vector_io (#2011)
# What does this PR do?
For the Issue :-
#[2010](https://github.com/meta-llama/llama-stack/issues/2010)
Currently, if we try to connect the Llama stack server to a remote
Milvus instance that has TLS enabled, the connection fails because TLS
support is not implemented in the Llama stack codebase. As a result,
users are unable to use secured Milvus deployments out of the box.

After adding this , the user will be able to connect to remote::Milvus
which is TLS enabled .
if TLS enabled :-
```
vector_io:
  - provider_id: milvus
    provider_type: remote::milvus
    config:
      uri: "http://<host>:<port>"
      token: "<user>:<password>"
      secure: True
      server_pem_path: "path/to/server.pem"
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
I have already tested it by connecting to a Milvus instance which is TLS
enabled and i was able to start llama stack server .
2025-05-06 14:15:34 +02:00
Christina Xu
65cc971877
docs: Add TrustyAI LM-Eval to list of known external providers (#2020)
# What does this PR do?
Adds documentation for the remote [TrustyAI LM-Eval Eval
Provider](https://github.com/trustyai-explainability/llama-stack-provider-lmeval).
LM-Eval is a service for large language model evaluation based on the
open source project
[lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
and is integrated into the [TrustyAI Kubernetes
Operator](https://trustyai-explainability.github.io/trustyai-site/main/trustyai-operator.html).
2025-05-06 14:11:55 +02:00
Sébastien Han
a5d151e912
docs: fix typo mivus.md -> milvus.md (#2102)
Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-05 09:48:38 -07:00
Ihar Hrachyshka
16e163da0e
docs: List external kubeflow pipelines provider prototype (#2100)
# What does this PR do?

Lists another external provider example (kfp).

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-05-05 10:24:52 +02:00
Christian Zaccaria
9f27578929
fix: improve Mermaid diagram visibility in dark mode (#2092)
# What does this PR do?
Closes #2078 

Previously, the Agent Execution Loop diagram was barely visible in dark
mode:


![image](https://github.com/user-attachments/assets/78567334-c57f-4cd0-ba93-290b20ed3aba)

I experimented with styling individual classes, but ultimately found
that adding an off-white background provides the best visibility in both
dark and light modes:


![image](https://github.com/user-attachments/assets/419d153a-d870-410b-b635-02b95da67a3d)

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

The documentation can be built locally by following the docs:
https://llama-stack.readthedocs.io/en/latest/contributing/index.html#building-the-documentation

[//]: # (## Documentation)
2025-05-02 13:09:45 -07:00
Ashwin Bharambe
272d3359ee
fix: remove code interpeter implementation (#2087)
# What does this PR do?

The builtin implementation of code interpreter is not robust and has a
really weak sandboxing shell (the `bubblewrap` container). Given the
availability of better MCP code interpreter servers coming up, we should
use them instead of baking an implementation into the Stack and
expanding the vulnerability surface to the rest of the Stack.

This PR only does the removal. We will add examples with how to
integrate with MCPs in subsequent ones.

## Test Plan

Existing tests.
2025-05-01 14:35:08 -07:00
Ihar Hrachyshka
9e6561a1ec
chore: enable pyupgrade fixes (#1806)
# What does this PR do?

The goal of this PR is code base modernization.

Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)

Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-05-01 14:23:50 -07:00
Derek Higgins
64829947d0
feat: Add temperature support to responses API (#2065)
# What does this PR do?
Add support for the temperature to the responses API 


## Test Plan
Manually tested simple case
unit tests added for simple case and tool calls

Signed-off-by: Derek Higgins <derekh@redhat.com>
2025-05-01 11:47:58 -07:00
Sébastien Han
dc94433072
feat(pre-commit): enhance pre-commit hooks with additional checks (#2014)
# What does this PR do?

Add several new pre-commit hooks to improve code quality and security:

- no-commit-to-branch: prevent direct commits to protected branches like
`main`
- check-yaml: validate YAML files
- detect-private-key: prevent accidental commit of private keys
- requirements-txt-fixer: maintain consistent requirements.txt format
and sorting
- mixed-line-ending: enforce LF line endings to avoid mixed line endings
- check-executables-have-shebangs: ensure executable scripts have
shebangs
- check-json: validate JSON files
- check-shebang-scripts-are-executable: verify shebang scripts are
executable
- check-symlinks: validate symlinks and report broken ones
- check-toml: validate TOML files mainly for pyproject.toml

The respective fixes have been included.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-30 11:35:49 -07:00
Nathan Weinberg
d897313e0b
feat: add additional logging to llama stack build (#1689)
# What does this PR do?
Partial revert of fa68ded07c

this commit ensures users know where their new templates are generated
and how to run the newly built distro locally

discussion on Discord:
1351652390

## Test Plan
Did a local run - let me know if we want any unit testing covering this

![Screenshot from 2025-03-18
22-38-18](https://github.com/user-attachments/assets/6d5dac52-edad-4a84-992f-a3c23cda10c8)

## Documentation
Updated "Zero to Hero" guide with new output

---------

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
2025-04-30 11:06:24 -07:00
Ashwin Bharambe
4d0bfbf984
feat: add api.llama provider, llama-guard-4 model (#2058)
This PR adds a llama-stack inference provider for `api.llama.com`, as
well as adds entries for Llama-Guard-4 and updated Prompt-Guard models.
2025-04-29 10:07:41 -07:00
Ben Browning
8dfce2f596
feat: OpenAI Responses API (#1989)
# What does this PR do?

This provides an initial [OpenAI Responses
API](https://platform.openai.com/docs/api-reference/responses)
implementation. The API is not yet complete, and this is more a
proof-of-concept to show how we can store responses in our key-value
stores and use them to support the Responses API concepts like
`previous_response_id`.

## Test Plan

I've added a new
`tests/integration/openai_responses/test_openai_responses.py` as part of
a test-driven development for this new API. I'm only testing this
locally with the remote-vllm provider for now, but it should work with
any of our inference providers since the only API it requires out of the
inference provider is the `openai_chat_completion` endpoint.

```
VLLM_URL="http://localhost:8000/v1" \
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack build --template remote-vllm --image-type venv --run
```

```
LLAMA_STACK_CONFIG="http://localhost:8321" \
python -m pytest -v \
  tests/integration/openai_responses/test_openai_responses.py \
  --text-model "meta-llama/Llama-3.2-3B-Instruct"
 ```

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-04-28 14:06:00 -07:00
Sébastien Han
79851d93aa
feat: Add Kubernetes authentication (#1778)
# What does this PR do?

This commit adds a new authentication system to the Llama Stack server
with support for Kubernetes and custom authentication providers. Key
changes include:

- Implemented KubernetesAuthProvider for validating Kubernetes service
account tokens
- Implemented CustomAuthProvider for validating tokens against external
endpoints - this is the same code that was already present.
- Added test for Kubernetes
- Updated server configuration to support authentication settings
- Added documentation for authentication configuration and usage

The authentication system supports:
- Bearer token validation
- Kubernetes service account token validation
- Custom authentication endpoints

## Test Plan

Setup a Kube cluster using Kind or Minikube.

Run a server with:

```
server:
  port: 8321
  auth:
    provider_type: kubernetes
    config:
      api_server_url: http://url
      ca_cert_path: path/to/cert (optional)
```

Run:

```
curl -s -L -H "Authorization: Bearer $(kubectl create token my-user)" http://127.0.0.1:8321/v1/providers
```

Or replace "my-user" with your service account.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-28 22:24:58 +02:00
Rashmi Pawar
e6bbf8d20b
feat: Add NVIDIA NeMo datastore (#1852)
# What does this PR do?
Implemetation of NeMO Datastore register, unregister API.

Open Issues: 
- provider_id gets set to `localfs` in client.datasets.register() as it
is specified in routing_tables.py: DatasetsRoutingTable
see: #1860

Currently I have passed `"provider_id":"nvidia"` in metadata and have
parsed that in `DatasetsRoutingTable`
(Not the best approach, but just a quick workaround to make it work for
now.)

## Test Plan
- Unit test cases: `pytest
tests/unit/providers/nvidia/test_datastore.py`
```bash
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0, asyncio-0.26.0, nbval-0.11.0, metadata-3.1.1, html-4.1.1, cov-6.1.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items                                                                                                                        

tests/unit/providers/nvidia/test_datastore.py ..                                                                                   [100%]

============================================================ warnings summary ============================================================

====================================================== 2 passed, 1 warning in 0.84s ======================================================
```

cc: @dglogo, @mattf, @yanxi0830
2025-04-28 09:41:59 -07:00
Derek Higgins
0e4307de0f
docs: Fix missing --gpu all flag in Docker run commands (#2026)
adding the --gpu all flag to Docker run commands
for meta-reference-gpu distributions ensures models are loaded into GPU
instead of CPU.

Remove docs for meta-reference-quantized-gpu
The distribution was removed in #1887
but these files were left behind.


Fixes: #1798

# What does this PR do?
Fixes doc to add --gpu all command to docker run

[//]: # (If resolving an issue, uncomment and update the line below)
Closes #1798

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

verified in docker documentation but untested

---------

Signed-off-by: Derek Higgins <derekh@redhat.com>
2025-04-25 12:17:31 -07:00
Sajikumar JS
1bb1d9b2ba
feat: Add watsonx inference adapter (#1895)
# What does this PR do?
IBM watsonx ai added as the inference [#1741
](https://github.com/meta-llama/llama-stack/issues/1741)

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

---------

Co-authored-by: Sajikumar JS <sajikumar.js@ibm.com>
2025-04-25 11:29:21 -07:00
Kevin Postlethwait
d9e00fca66
fix: specify nbformat version in nb (#2023)
# What does this PR do?
Adding nbformat version fixes this issue. Not sure exactly why this
needs to be done, but this version was rewritten to the bottom of a nb
file when I changed its name trying to get to the bottom of this. When I
opened it on GH the issue was no longer present
 Closes #1837 

## Test Plan
N/A
2025-04-25 10:10:37 +02:00
Rashmi Pawar
ace82836c1
feat: NVIDIA allow non-llama model registration (#1859)
# What does this PR do?
Adds custom model registration functionality to NVIDIAInferenceAdapter
which let's the inference happen on:
- post-training model
- non-llama models in API Catalogue(behind
https://integrate.api.nvidia.com and endpoints compatible with
AyncOpenAI)

## Example Usage:
```python
from llama_stack.apis.models import Model, ModelType
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("nvidia")
_ = client.initialize()

client.models.register(
        model_id=model_name,
        model_type=ModelType.llm,
        provider_id="nvidia"
)

response = client.inference.chat_completion(
    model_id=model_name,
    messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Write a limerick about the wonders of GPU computing."}],
)
```

## Test Plan
```bash
pytest tests/unit/providers/nvidia/test_supervised_fine_tuning.py 
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0
collected 6 items                                                                                                                        

tests/unit/providers/nvidia/test_supervised_fine_tuning.py ......                                                                  [100%]

============================================================ warnings summary ============================================================
../miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076
  /home/ubuntu/miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'contentEncoding'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
    warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================== 6 passed, 1 warning in 1.51s ======================================================
```

[//]: # (## Documentation)
Updated Readme.md

cc: @dglogo, @sumitb, @mattf
2025-04-24 17:13:33 -07:00
Jash Gulabrai
cc77f79f55
feat: Add NVIDIA Eval integration (#1890)
# What does this PR do?
This PR adds support for NVIDIA's NeMo Evaluator API to the Llama Stack
eval module. The integration enables users to evaluate models via the
Llama Stack interface.

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
1. Added unit tests and successfully ran from root of project:
`./scripts/unit-tests.sh tests/unit/providers/nvidia/test_eval.py`
```
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_cancel PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_result PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_status PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_register_benchmark PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_run_eval PASSED
```
2. Verified I could build the Llama Stack image: `LLAMA_STACK_DIR=$(pwd)
llama stack build --template nvidia --image-type venv`

Documentation added to
`llama_stack/providers/remote/eval/nvidia/README.md`

---------

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
2025-04-24 17:12:42 -07:00
Charlie Doern
a673697858
chore: rename ramalama provider (#2008)
# What does this PR do?

the ramalama team has decided to rename their external provider
`ramalama-stack` (more catchy!). Update docs accordingly

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-04-24 09:34:15 +02:00
Nathan Weinberg
6a44e7ba20
docs: add API to external providers table (#2006)
Also does a minor reorg of the columns

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
2025-04-23 15:58:10 +02:00
Kevin Postlethwait
e0fa67c81c
docs: add examples for how to define RAG docs (#1981)
# What does this PR do?
Add examples for how to define RAGDocuments. Not sure if this is the
best place for these docs. @raghotham Please advise

## Test Plan
None, documentation

[//]: # (## Documentation)

Signed-off-by: Kevin <kpostlet@redhat.com>
2025-04-23 15:39:18 +02:00
Nathan Weinberg
d6e88e0bc6
docs: add RamaLama to list of known external providers (#2004)
The RamaLama project now has an external provider offering for Llama
Stack: https://github.com/containers/llama-stack-provider-ramalama

See also: https://github.com/meta-llama/llama-stack/pull/1676

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
2025-04-23 09:44:18 +02:00
Jash Gulabrai
0d06c654d0
feat: Update NVIDIA to GA docs; remove notebook reference until ready (#1999)
# What does this PR do?
- Update NVIDIA documentation links to GA docs
- Remove reference to notebooks until merged

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
2025-04-18 19:13:18 -04:00
Sébastien Han
94f83382eb
feat: allow building distro with external providers (#1967)
# What does this PR do?

We can now build a distribution that includes external providers.
Closes: https://github.com/meta-llama/llama-stack/issues/1948

## Test Plan

Build a distro with an external provider following the doc instructions.

[//]: # (## Documentation)

Added.

Rendered:


![Screenshot 2025-04-18 at 11 26
39](https://github.com/user-attachments/assets/afcf3d50-8d30-48c3-8d24-06a4b3662881)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-18 17:18:28 +02:00
Yuan Tang
c4570bcb48
docs: Add tips for debugging remote vLLM provider (#1992)
# What does this PR do?

This is helpful when debugging issues with vLLM + Llama Stack after this
PR https://github.com/vllm-project/vllm/pull/15593

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-04-18 14:47:47 +02:00
Yuan Tang
4c6b7005fa
fix: Fix docs lint issues (#1993)
# What does this PR do?

This was not caught as part of the CI build:
dd62a2388c.
[This PR](https://github.com/meta-llama/llama-stack/pull/1354) was too
old and didn't include the additional CI builds yet.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-04-18 02:33:13 -04:00