Commit graph

135 commits

Author SHA1 Message Date
Derek Higgins
0e4307de0f
docs: Fix missing --gpu all flag in Docker run commands (#2026)
adding the --gpu all flag to Docker run commands
for meta-reference-gpu distributions ensures models are loaded into GPU
instead of CPU.

Remove docs for meta-reference-quantized-gpu
The distribution was removed in #1887
but these files were left behind.


Fixes: #1798

# What does this PR do?
Fixes doc to add --gpu all command to docker run

[//]: # (If resolving an issue, uncomment and update the line below)
Closes #1798

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

verified in docker documentation but untested

---------

Signed-off-by: Derek Higgins <derekh@redhat.com>
2025-04-25 12:17:31 -07:00
Sajikumar JS
1bb1d9b2ba
feat: Add watsonx inference adapter (#1895)
# What does this PR do?
IBM watsonx ai added as the inference [#1741
](https://github.com/meta-llama/llama-stack/issues/1741)

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

---------

Co-authored-by: Sajikumar JS <sajikumar.js@ibm.com>
2025-04-25 11:29:21 -07:00
Rashmi Pawar
ace82836c1
feat: NVIDIA allow non-llama model registration (#1859)
# What does this PR do?
Adds custom model registration functionality to NVIDIAInferenceAdapter
which let's the inference happen on:
- post-training model
- non-llama models in API Catalogue(behind
https://integrate.api.nvidia.com and endpoints compatible with
AyncOpenAI)

## Example Usage:
```python
from llama_stack.apis.models import Model, ModelType
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
client = LlamaStackAsLibraryClient("nvidia")
_ = client.initialize()

client.models.register(
        model_id=model_name,
        model_type=ModelType.llm,
        provider_id="nvidia"
)

response = client.inference.chat_completion(
    model_id=model_name,
    messages=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Write a limerick about the wonders of GPU computing."}],
)
```

## Test Plan
```bash
pytest tests/unit/providers/nvidia/test_supervised_fine_tuning.py 
========================================================== test session starts ===========================================================
platform linux -- Python 3.10.0, pytest-8.3.5, pluggy-1.5.0
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: anyio-4.9.0
collected 6 items                                                                                                                        

tests/unit/providers/nvidia/test_supervised_fine_tuning.py ......                                                                  [100%]

============================================================ warnings summary ============================================================
../miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076
  /home/ubuntu/miniconda/envs/nvidia-1/lib/python3.10/site-packages/pydantic/fields.py:1076: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'contentEncoding'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.11/migration/
    warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================== 6 passed, 1 warning in 1.51s ======================================================
```

[//]: # (## Documentation)
Updated Readme.md

cc: @dglogo, @sumitb, @mattf
2025-04-24 17:13:33 -07:00
Jash Gulabrai
cc77f79f55
feat: Add NVIDIA Eval integration (#1890)
# What does this PR do?
This PR adds support for NVIDIA's NeMo Evaluator API to the Llama Stack
eval module. The integration enables users to evaluate models via the
Llama Stack interface.

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
1. Added unit tests and successfully ran from root of project:
`./scripts/unit-tests.sh tests/unit/providers/nvidia/test_eval.py`
```
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_cancel PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_result PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_job_status PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_register_benchmark PASSED
tests/unit/providers/nvidia/test_eval.py::TestNVIDIAEvalImpl::test_run_eval PASSED
```
2. Verified I could build the Llama Stack image: `LLAMA_STACK_DIR=$(pwd)
llama stack build --template nvidia --image-type venv`

Documentation added to
`llama_stack/providers/remote/eval/nvidia/README.md`

---------

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
2025-04-24 17:12:42 -07:00
Jash Gulabrai
0d06c654d0
feat: Update NVIDIA to GA docs; remove notebook reference until ready (#1999)
# What does this PR do?
- Update NVIDIA documentation links to GA docs
- Remove reference to notebooks until merged

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
2025-04-18 19:13:18 -04:00
Sébastien Han
94f83382eb
feat: allow building distro with external providers (#1967)
# What does this PR do?

We can now build a distribution that includes external providers.
Closes: https://github.com/meta-llama/llama-stack/issues/1948

## Test Plan

Build a distro with an external provider following the doc instructions.

[//]: # (## Documentation)

Added.

Rendered:


![Screenshot 2025-04-18 at 11 26
39](https://github.com/user-attachments/assets/afcf3d50-8d30-48c3-8d24-06a4b3662881)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-18 17:18:28 +02:00
Yuan Tang
c4570bcb48
docs: Add tips for debugging remote vLLM provider (#1992)
# What does this PR do?

This is helpful when debugging issues with vLLM + Llama Stack after this
PR https://github.com/vllm-project/vllm/pull/15593

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-04-18 14:47:47 +02:00
Matthew Farrellee
4205376653
chore: add meta/llama-3.3-70b-instruct as supported nvidia inference provider model (#1985)
see https://build.nvidia.com/meta/llama-3_3-70b-instruct
2025-04-17 06:50:40 -07:00
Jash Gulabrai
2ae1d7f4e6
docs: Add NVIDIA platform distro docs (#1971)
# What does this PR do?
Add NVIDIA platform docs that serve as a starting point for Llama Stack
users and explains all supported microservices.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

---------

Co-authored-by: Jash Gulabrai <jgulabrai@nvidia.com>
2025-04-17 05:54:30 -07:00
Chirag Modi
fb8ff77ff2
docs: 0.2.2 doc updates (#1961)
Add updates to android site readme for 0.2.2
2025-04-15 13:26:17 -07:00
Dmitry Rogozhkin
71ed47ea76
docs: add example for intel gpu in vllm remote (#1952)
# What does this PR do?

PR adds instructions to setup vLLM remote endpoint for vllm-remote llama
stack distribution.

## Test Plan

* Verified with manual tests of the configured vllm-remote against vllm
endpoint running on the system with Intel GPU
* Also verified with ci pytests (see cmdline below). Test passes in the
same capacity as it does on the A10 Nvidia setup (some tests do fail
which seems to be known issues with vllm remote llama stack
distribution)

```
pytest -s -v tests/integration/inference/test_text_inference.py \
   --stack-config=http://localhost:5001 \
   --text-model=meta-llama/Llama-3.2-3B-Instruct
```

CC: @ashwinb

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
2025-04-15 07:56:23 -07:00
Ben Browning
7641a5cd0b
fix: 100% OpenAI API verification for together and fireworks (#1946)
# What does this PR do?

TLDR: Changes needed to get 100% passing tests for OpenAI API
verification tests when run against Llama Stack with the `together`,
`fireworks`, and `openai` providers. And `groq` is better than before,
at 88% passing.

This cleans up the OpenAI API support for image message types
(specifically `image_url` types) and handling of the `response_format`
chat completion parameter. Both of these required a few more Pydantic
model definitions in our Inference API, just to move from the
not-quite-right stubs I had in place to something fleshed out to match
the actual OpenAI API specs.

As part of testing this, I also found and fixed a bug in the litellm
implementation of openai_completion and openai_chat_completion, so the
providers based on those should actually be working now.

The method `prepare_openai_completion_params` in
`llama_stack/providers/utils/inference/openai_compat.py` was improved to
actually recursively clean up input parameters, including handling of
lists, dicts, and dumping of Pydantic models to dicts. These changes
were required to get to 100% passing tests on the OpenAI API
verification against the `openai` provider.

With the above, the together.ai provider was passing as well as it is
without Llama Stack. But, since we have Llama Stack in the middle, I
took the opportunity to clean up the together.ai provider so that it now
also passes the OpenAI API spec tests we have at 100%. That means
together.ai is now passing our verification test better when using an
OpenAI client talking to Llama Stack than it is when hitting together.ai
directly, without Llama Stack in the middle.

And, another round of work for Fireworks to improve translation of
incoming OpenAI chat completion requests to Llama Stack chat completion
requests gets the fireworks provider passing at 100%. The server-side
fireworks.ai tool calling support with OpenAI chat completions and Llama
4 models isn't great yet, but by pointing the OpenAI clients at Llama
Stack's API we can clean things up and get everything working as
expected for Llama 4 models.

## Test Plan

### OpenAI API Verification Tests

I ran the OpenAI API verification tests as below and 100% of the tests
passed.

First, start a Llama Stack server that runs the `openai` provider with
the `gpt-4o` and `gpt-4o-mini` models deployed. There's not a template
setup to do this out of the box, so I added a
`tests/verifications/openai-api-verification-run.yaml` to do this.

First, ensure you have the necessary API key environment variables set:

```
export TOGETHER_API_KEY="..."
export FIREWORKS_API_KEY="..."
export OPENAI_API_KEY="..."
```

Then, run a Llama Stack server that serves up all these providers:

```
llama stack run \
      --image-type venv \
      tests/verifications/openai-api-verification-run.yaml
```

Finally, generate a new verification report against all these providers,
both with and without the Llama Stack server in the middle.

```
python tests/verifications/generate_report.py \
      --run-tests \
      --provider \
        together \
        fireworks \
        groq \
        openai \
        together-llama-stack \
        fireworks-llama-stack \
        groq-llama-stack \
        openai-llama-stack
```

You'll see that most of the configurations with Llama Stack in the
middle now pass at 100%, even though some of them do not pass at 100%
when hitting the backend provider's API directly with an OpenAI client.

### OpenAI Completion Integration Tests with vLLM:

I also ran the smaller `test_openai_completion.py` test suite (that's
not yet merged with the verification tests) on multiple of the
providers, since I had to adjust the method signature of
openai_chat_completion a bit and thus had to touch lots of these
providers to match. Here's the tests I ran there, all passing:

```
VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct"
```

### OpenAI Completion Integration Tests with ollama

```
INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0"
```

### OpenAI Completion Integration Tests with together.ai

```
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" llama stack build --template together --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct-Turbo" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct-Turbo"
```

### OpenAI Completion Integration Tests with fireworks.ai

```
INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" llama stack build --template fireworks --image-type venv --run
```

in another terminal

```
LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.1-8B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.1-8B-Instruct"

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-04-14 08:56:29 -07:00
Sébastien Han
68eeacec0e
docs: resync missing nvidia doc (#1947)
# What does this PR do?

Resync doc.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-14 15:09:16 +02:00
Nathan Weinberg
854c2ad264
fix: misleading help text for 'llama stack build' and 'llama stack run' (#1910)
# What does this PR do?
current text for 'llama stack build' and 'llama stack run' says that if
no argument is passed to '--image-name' that the active Conda
environment will be used

in reality, the active enviroment is used whether it is from conda,
virtualenv, etc.

## Test Plan
N/A

## Documentation
N/A

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
2025-04-12 01:19:11 -07:00
raghotham
ed58a94b30
docs: fixes to quick start (#1943)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

---------

Co-authored-by: Francisco Arceo <farceo@redhat.com>
2025-04-11 13:41:23 -07:00
Mark Campbell
6aa459b00c
docs: fix errors in kubernetes deployment guide (#1914)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
Fixes a couple of errors in PVC/Secret setup and adds context for
expected Hugging Face token
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)
2025-04-11 13:04:13 +02:00
Francisco Arceo
49955a06b1
docs: Update quickstart page to structure things a little more for the novices (#1873)
# What does this PR do?
Another doc enhancement for
https://github.com/meta-llama/llama-stack/issues/1818

Summary of changes:
- `docs/source/distributions/configuration.md`
   - Updated dropdown title to include a more user-friendly description.

- `docs/_static/css/my_theme.css`
   - Added styling for `<h3>` elements to set a normal font weight.

- `docs/source/distributions/starting_llama_stack_server.md`
- Changed section headers from bold text to proper markdown headers
(e.g., `##`).
- Improved descriptions for starting Llama Stack server using different
methods (library, container, conda, Kubernetes).
- Enhanced clarity and structure by converting instructions into
markdown headers and improved formatting.

- `docs/source/getting_started/index.md`
   - Major restructuring of the "Quick Start" guide:
- Added new introductory section for Llama Stack and its capabilities.
- Reorganized steps into clearer subsections with proper markdown
headers.
- Replaced dropdowns with tabbed content for OS-specific instructions.
- Added detailed steps for setting up and running the Llama Stack server
and client.
- Introduced new sections for running basic inference and building
agents.
- Enhanced readability and visual structure with emojis, admonitions,
and examples.

- `docs/source/providers/index.md`
   - Updated the list of LLM inference providers to include "Ollama."
   - Expanded the list of vector databases to include "SQLite-Vec."

Let me know if you need further details!

## Test Plan
Renders locally, included screenshot.

# Documentation

For https://github.com/meta-llama/llama-stack/issues/1818

<img width="1332" alt="Screenshot 2025-04-09 at 11 07 12 AM"
src="https://github.com/user-attachments/assets/c106efb9-076c-4059-a4e0-a30fa738585b"
/>

---------

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-04-10 14:09:00 -07:00
Yuan Tang
1be66d754e
docs: Redirect instructions for additional hardware accelerators for remote vLLM provider (#1923)
# What does this PR do?

vLLM website just added a [new index page for installing for different
hardware
accelerators](https://docs.vllm.ai/en/latest/getting_started/installation.html).
This PR adds a link to that page with additional edits to make sure
readers are aware that the use of GPUs on this page are for
demonstration purposes only.

This closes https://github.com/meta-llama/llama-stack/issues/1813.

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-04-10 10:04:17 +02:00
Yuan Tang
712c6758c6
docs: Avoid bash script syntax highlighting for dark mode (#1918)
See
https://github.com/meta-llama/llama-stack/pull/1913#issuecomment-2790153778

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-04-09 15:43:43 -07:00
AlexHe99
983f6feeb8
docs: Update remote-vllm.md with AMD GPU vLLM server supported. (#1858)
Add the content to use AMD GPU as the vLLM server. Split the original
part to two sub chapters,
1. AMD vLLM server
2. NVIDIA vLLM server (orignal)

# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

---------

Signed-off-by: Alex He <alehe@amd.com>
2025-04-08 21:35:32 -07:00
ehhuang
7b4eb0967e
test: verification on provider's OAI endpoints (#1893)
# What does this PR do?


## Test Plan
export MODEL=accounts/fireworks/models/llama4-scout-instruct-basic;
LLAMA_STACK_CONFIG=verification pytest -s -v tests/integration/inference
--vision-model $MODEL --text-model $MODEL
2025-04-07 23:06:28 -07:00
Matthew Farrellee
c52ccc4bbd
docs: update importing_as_library.md (#1863)
LlamaStackAsLibraryClient.initialize is not async, cannot be await'd
2025-04-07 12:31:04 +02:00
Francisco Arceo
d495922949
docs: Updated documentation and Sphinx configuration (#1845)
# What does this PR do?

The goal of this PR is to make the pages easier to navigate by surfacing
the child pages on the navbar, updating some of the copy, moving some of
the files around.

Some changes:
1. Clarifying Titles
2. Restructuring "Distributions" more formally in its own page to be
consistent with Providers and adding some clarity to the child pages to
surface them and make them easier to navigate
3. Updated sphinx config to not collapse navigation by default
4. Updated copyright year to be calculated dynamically 
5. Moved `docs/source/distributions/index.md` ->
`docs/source/distributions/starting_llama_stack_server.md`

Another for https://github.com/meta-llama/llama-stack/issues/1815

## Test Plan
Tested locally and pages build (screen shots for example).

## Documentation
###  Before:
![Screenshot 2025-03-31 at 1 09
21 PM](https://github.com/user-attachments/assets/98e34f76-f0d9-4055-8e2c-441b1e7d8f6a)

### After:
![Screenshot 2025-03-31 at 1 08
52 PM](https://github.com/user-attachments/assets/dfb6b8ad-3a1d-46b6-8f54-0c553664093f)

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-03-31 13:08:05 -07:00
Ihar Hrachyshka
18bac27d4e
fix: Use CONDA_DEFAULT_ENV presence as a flag to use conda mode (#1555)
# What does this PR do?

This is the second attempt to switch to system packages by default. Now
with a hack to detect conda environment - in which case conda image-type
is used.

Note: Conda will only be used when --image-name is unset *and*
CONDA_DEFAULT_ENV is set. This means that users without conda will
correctly fall back to using system packages when no --image-* arguments
are passed at all.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Uses virtualenv:

```
$ llama stack build --template ollama --image-type venv
$ llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml
[...]
Using virtual environment: /home/ec2-user/src/llama-stack/schedule/.local
[...]
```

Uses system packages (virtualenv already initialized):

```
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
INFO     2025-03-27 20:46:22,882 llama_stack.cli.stack.run:142 server: No image type or image name provided. Assuming environment packages.
[...]
```

Attempt to run from environment packages without necessary packages
installed:
```
$ python -m venv barebones
$ . ./barebones/bin/activate
$ pip install -e . # to install llama command
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
ModuleNotFoundError: No module named 'fastapi'
```

^ failed as expected because the environment doesn't have necessary
packages installed.

Now install some packages in the new environment:

```
$ pip install fastapi opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp aiosqlite ollama openai datasets faiss-cpu mcp autoevals
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
```

Now see if setting CONDA_DEFAULT_ENV will change what happens by
default:

```
$ export CONDA_DEFAULT_ENV=base
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
Using conda environment: base
Conda environment base does not exist.
[...]
```

---------

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-03-27 17:13:22 -04:00
Xi Yan
b5c27f77ad
chore: clean up distro doc (#1804)
# What does this PR do?
- hide distro doc (docker needs to be thoroughly tested). 

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
- docs

[//]: # (## Documentation)
2025-03-27 12:12:14 -07:00
Dmitry Rogozhkin
935e706b15
docs: fix remote-vllm instructions (#1805)
# What does this PR do?

* Fix location of `run.yaml` relative to the cloned llama stack
repository
* Drop `-it` from `docker run` commands as its not needed running
services

## Test Plan

* Verified running the llama stack following updated instruction

CC: @ashwinb

Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
2025-03-27 10:19:51 -04:00
Rashmi Pawar
1a73f8305b
feat: Add nemo customizer (#1448)
# What does this PR do?

This PR adds support for NVIDIA's NeMo Customizer API to the Llama Stack
post-training module. The integration enables users to fine-tune models
using NVIDIA's cloud-based customization service through a consistent
Llama Stack interface.


[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
Yet to be done

Things pending under this PR:

- [x] Integration of fine-tuned model(new checkpoint) for inference with
nvidia llm distribution
- [x] distribution integration of API
- [x] Add test cases for customizer(In Progress)
- [x] Documentation

```

LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -v tests/client-sdk/post_training/test_supervised_fine_tuning.py 

============================================================================================================================================================================ test session starts =============================================================================================================================================================================
platform linux -- Python 3.10.0, pytest-8.3.4, pluggy-1.5.0 -- /home/ubuntu/llama-stack/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.0', 'Platform': 'Linux-6.8.0-1021-gcp-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'nbval': '0.11.0', 'metadata': '3.1.1', 'anyio': '4.8.0', 'html': '4.1.1', 'asyncio': '0.25.3'}}
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: nbval-0.11.0, metadata-3.1.1, anyio-4.8.0, html-4.1.1, asyncio-0.25.3
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 2 items                                                                                                                                                                                                                                                                                                                                                            

tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_post_training_provider_registration[txt=8B] PASSED                                                                                                                                                                                                                                                 [ 50%]
tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_list_training_jobs[txt=8B] PASSED                                                                                                                                                                                                                                                                  [100%]

======================================================================================================================================================================== 2 passed, 1 warning in 0.10s ========================================================================================================================================================================
```
cc: @mattf @dglogo @sumitb

---------

Co-authored-by: Ubuntu <ubuntu@llama-stack-customizer-dev-inst-2tx95fyisatvlic4we8hidx5tfj.us-central1-a.c.brevdevprod.internal>
2025-03-25 11:01:10 -07:00
Yuan Tang
9ff82036f7
docs: Simplify vLLM deployment in K8s deployment guide (#1655)
# What does this PR do?

* Removes the use of `huggingface-cli` 
* Simplifies HF cache mount path
* Simplifies vLLM server startup command
* Separates PVC/secret creation from deployment/service
* Fixes a typo: "pod" should be "deployment"

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-24 09:08:50 -07:00
Hardik Shah
127bac6869
fix: Default to port 8321 everywhere (#1734)
As titled, moved all instances of 5001 to 8321
2025-03-20 15:50:41 -07:00
Hardik Shah
581e8ae562
fix: docker run with --pull always to fetch the latest image (#1733)
As titled
2025-03-20 15:35:48 -07:00
Yuan Tang
f5a5c5d459
docs: Add instruction on enabling tool calling for remote vLLM (#1719)
# What does this PR do?

This PR adds a link to tool calling instructions in vLLM. Users have
asked about this many times, e.g.
https://github.com/meta-llama/llama-stack/issues/1648#issuecomment-2740642077

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-03-20 15:18:17 -07:00
Nathan Weinberg
1261bc93bf
docs: fixed broken tip in distro build docs (#1673)
# What does this PR do?
fixed broken tip in distro build docs

## Test Plan
Local docs build

Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
2025-03-17 17:22:26 -07:00
cdgamarose-nv
252a487085
feat: added nvidia as safety provider (#1248)
# What does this PR do?
Adds nvidia as a safety provider by interfacing with the nemo guardrails
microservice.
This enables checking user’s input or the LLM’s output against input and
output guardrails by using the `/v1/guardrails/checks` endpoint of the[
guardrails
API.](https://developer.nvidia.com/docs/nemo-microservices/guardrails/source/guides/checks-guide.html)

## Test Plan
Deploy nemo guardrails service following the documentation:
https://developer.nvidia.com/docs/nemo-microservices/guardrails/source/getting-started/deploy-docker.html

### Standalone:
```bash
(venv) local-cdgamarose@a1u1g-rome-0153:~/llama-stack$ pytest -v -s llama_stack/providers/tests/safety/test_safety.py --providers inference=nvidia,safety=nvidia --safety-shield meta/llama-3.1-8b-instruct

=================================================================================== test session starts ===================================================================================
platform linux -- Python 3.10.12, pytest-8.3.4, pluggy-1.5.0 -- /localhome/local-cdgamarose/llama-stack/venv/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.10.12', 'Platform': 'Linux-5.15.0-122-generic-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0', 'html': '4.1.1'}}
rootdir: /localhome/local-cdgamarose/llama-stack
configfile: pyproject.toml
plugins: metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0, html-4.1.1
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 2 items

llama_stack/providers/tests/safety/test_safety.py::TestSafety::test_shield_list[--inference=nvidia:safety=nvidia] Initializing NVIDIASafetyAdapter(http://0.0.0.0:7331)...
PASSED
llama_stack/providers/tests/safety/test_safety.py::TestSafety::test_run_shield[--inference=nvidia:safety=nvidia] PASSED

============================================================================== 2 passed, 2 warnings in 4.78s ==============================================================================

```
### Distribution:
```
llama stack run llama_stack/templates/nvidia/run-with-safety.yaml
curl -v -X 'POST' "http://localhost:8321/v1/safety/run-shield" -H 'accept: application/json' -H 'Content-Type: application/json' -d '{"shield_id": "meta/llama-3.1-8b-instruct", "messages":[{"role": "user", "content": "you are stupid"}]}'
{"violation":{"violation_level":"error","user_message":"Sorry I cannot do this.","metadata":{"self check input":{"status":"blocked"}}}}
```

[//]: # (## Documentation)

---------

Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-03-17 14:39:23 -07:00
Kelly Brown
60ae7455f6
docs: Fix trailing whitespace error (#1669)
Description: Fixes the trailing whitespace error thats coming up on main
2025-03-17 08:53:30 -07:00
Chirag Modi
b56b06037c
Web updates to point to latest releases for Mobile SDK (#1650)
# What does this PR do?
Web updates to point to latest releases for Mobile SDK

- point to `latest-release` branch for mobile sdk repos to minimize the
number of change points on the site.
- updates to some instructions
2025-03-14 17:06:07 -07:00
Xi Yan
9617468d13
fix: passthrough provider template + fix (#1612)
# What does this PR do?

- Fix issue w/ passthrough provider


[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
llama stack run

[//]: # (## Documentation)
2025-03-13 09:44:26 -07:00
Dinesh Yeduguru
85501ed875
fix: remove Llama-3.2-1B-Instruct for fireworks (#1558)
# What does this PR do?
remove Llama-3.2-1B-Instruct for fireworks as its no longer appears to
be hosted on website.


## Test Plan

python distro_codegen.py
2025-03-11 11:19:29 -07:00
Charlie Doern
b647ecd9ed
feat: add support for LLAMA_STACK_LOG_FILE (#1450)
# What does this PR do?

setting $LLAMA_STACK_LOG_FILE will pipe the logs to a file as well as
stdout. this is done by using a logging FileHandler

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-03-11 11:09:31 -07:00
Ashwin Bharambe
dc84bc755a
fix: revert to using faiss for ollama distro (#1530)
This is unfortunate because `sqlite-vec` seems promising. But its PIP
package is not quite complete. It does not have binary for arm64 (I
think, or maybe it even lacks 64 bit builds?) which results in the arm64
container resulting in
```
File "/usr/local/lib/python3.10/site-packages/sqlite_vec/init.py", line 17, in load
    conn.load_extension(loadable_path())
sqlite3.OperationalError: /usr/local/lib/python3.10/site-packages/sqlite_vec/vec0.so: wrong ELF class: ELFCLASS32
```

To get around I tried to install from source via `uv pip install
sqlite-vec --no-binary=sqlite-vec` however it even lacks a source
distribution which makes that impossible.

## Test Plan

Build the container locally using: 

```bash
LLAMA_STACK_DIR=. llama stack build --template ollama --image-type container
```

Run the container as: 

```
podman run --privileged -it -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
   -v ~/.llama:/root/.llama \
    --env INFERENCE_MODEL=$INFERENCE_MODEL \
    --env OLLAMA_URL=http://host.containers.internal:11434 \
    -v ~/local/llama-stack:/app/llama-stack-source 
    localhost/distribution-ollama:dev --port $LLAMA_STACK_PORT
```

Verify the container starts up correctly. Without this patch, it would
encounter the ELFCLASS32 error.
2025-03-10 16:15:17 -07:00
Reid
0b8cb830b9
docs: update ollama doc url (#1508)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

It should changed in this pr
https://github.com/meta-llama/llama-stack/pull/1190/files#diff-53e3f35ced54ee5e57dc8b0d3b04770ed84f2f6434c6f492f42569b3c2810ecd

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-03-10 13:04:59 -07:00
Charlie Doern
1097912054
refactor: display defaults in help text (#1480)
# What does this PR do?

using `formatter_class=argparse.ArgumentDefaultsHelpFormatter` displays
(default: DEFAULT_VALUE) for each flag. add this formatter class to
build and run to show users some default values like `conda`, `8321`,
etc

## Test Plan

ran locally with following output: 

before: 
```
llama stack run --help
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--disable-ipv6] [--env KEY=VALUE] [--tls-keyfile TLS_KEYFILE] [--tls-certfile TLS_CERTFILE]
                       [--image-type {conda,container,venv}]
                       config

Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.

positional arguments:
  config                Path to config file to use for the run

options:
  -h, --help            show this help message and exit
  --port PORT           Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. Defaults to 8321
  --image-name IMAGE_NAME
                        Name of the image to run. Defaults to the current conda environment
  --disable-ipv6        Disable IPv6 support
  --env KEY=VALUE       Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times.
  --tls-keyfile TLS_KEYFILE
                        Path to TLS key file for HTTPS
  --tls-certfile TLS_CERTFILE
                        Path to TLS certificate file for HTTPS
  --image-type {conda,container,venv}
                        Image Type used during the build. This can be either conda or container or venv.
```

after:
```
llama stack run --help
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--disable-ipv6] [--env KEY=VALUE] [--tls-keyfile TLS_KEYFILE] [--tls-certfile TLS_CERTFILE]
                       [--image-type {conda,container,venv}]
                       config

Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.

positional arguments:
  config                Path to config file to use for the run

options:
  -h, --help            show this help message and exit
  --port PORT           Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. (default: 8321)
  --image-name IMAGE_NAME
                        Name of the image to run. Defaults to the current conda environment (default: None)
  --disable-ipv6        Disable IPv6 support (default: False)
  --env KEY=VALUE       Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times. (default: [])
  --tls-keyfile TLS_KEYFILE
                        Path to TLS key file for HTTPS (default: None)
  --tls-certfile TLS_CERTFILE
                        Path to TLS certificate file for HTTPS (default: None)
  --image-type {conda,container,venv}
                        Image Type used during the build. This can be either conda or container or venv. (default: conda)
```

[//]: # (## Documentation)

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-03-07 11:05:58 -08:00
Reid
40cd48fa09
chore: remove the incorrect output (#1472)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

Based on the client output changed, so the output is incorrect:

458e20702b/src/llama_stack_client/lib/cli/models/models.py (L52)
and
https://github.com/meta-llama/llama-stack/pull/1348#pullrequestreview-2654971315
previous discussion that no need to maintain the output, so remove it.

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-03-07 10:39:33 -08:00
Charlie Doern
8d86137ab2
docs: add information on how to set log level before running (#1430)
# What does this PR do?

currently logcat is not documented for build && run. Add documentation
in building_distro.md

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-03-06 10:54:14 -08:00
Ashwin Bharambe
abfbaf3c1b
refactor(test): move tools, evals, datasetio, scoring and post training tests (#1401)
All of the tests from `llama_stack/providers/tests/` are now moved to
`tests/integration`.

I converted the `tools`, `scoring` and `datasetio` tests to use API.
However, `eval` and `post_training` proved to be a bit challenging to
leaving those. I think `post_training` should be relatively
straightforward also.

As part of this, I noticed that `wolfram_alpha` tool wasn't added to
some of our commonly used distros so I added it. I am going to remove a
lot of code duplication from distros next so while this looks like a
one-off right now, it will go away and be there uniformly for all
distros.
2025-03-04 14:53:47 -08:00
Reid
cb085d56c6
docs: fix typo (#1390)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

---------

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-03-04 09:02:55 -08:00
Reid
5c9d12a206
chore: improve --port help text (#1346)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

It would be better to tell user env var usage in help text.
```
before:
$ llama stack run --help
  --port PORT           Port to run the server on. Defaults to 8321

after
$ llama stack run --help
  --port PORT           Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. Defaults to 8321
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-03-03 16:49:03 -08:00
Reid
dc069025f5
chore: fix typo (#1343)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]


21ec67356c/distributions

It should missed the `s`.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-03-01 10:36:04 -08:00
Reid
66cd128ab5
docs: update the downloaded list doc (#1266)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

Since released the `--downloaded` option, so update the related
documents.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-28 10:10:12 -08:00
Reid
5366dab31e
docs: update build doc (#1262)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]


55eb257459/llama_stack/cli/stack/run.py (L22)


55eb257459/llama_stack/cli/stack/_build.py (L103)

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-28 10:03:45 -08:00
Reid
c2d2a80b0a
docs: update the output of llama-stack-client models list (#1271)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-27 16:46:38 -08:00