Commit graph

68 commits

Author SHA1 Message Date
Sébastien Han
43c1f39bd6
refactor(env)!: enhanced environment variable substitution (#2490)
# What does this PR do?

This commit significantly improves the environment variable substitution
functionality in Llama Stack configuration files:
* The version field in configuration files has been changed from string
to integer type for better type consistency across build and run
configurations.

* The environment variable substitution system for ${env.FOO:} was fixed
and properly returns an error

* The environment variable substitution system for ${env.FOO+} returns
None instead of an empty strings, it better matches type annotations in
config fields

* The system includes automatic type conversion for boolean, integer,
and float values.

* The error messages have been enhanced to provide clearer guidance when
environment variables are missing, including suggestions for using
default values or conditional syntax.

* Comprehensive documentation has been added to the configuration guide
explaining all supported syntax patterns, best practices, and runtime
override capabilities.

* Multiple provider configurations have been updated to use the new
conditional syntax for optional API keys, making the system more
flexible for different deployment scenarios. The telemetry configuration
has been improved to properly handle optional endpoints with appropriate
validation, ensuring that required endpoints are specified when their
corresponding sinks are enabled.

* There were many instances of ${env.NVIDIA_API_KEY:} that should have
caused the code to fail. However, due to a bug, the distro server was
still being started, and early validation wasn’t triggered. As a result,
failures were likely being handled downstream by the providers. I’ve
maintained similar behavior by using ${env.NVIDIA_API_KEY:+}, though I
believe this is incorrect for many configurations. I’ll leave it to each
provider to correct it as needed.

* Environment variable substitution now uses the same syntax as Bash
parameter expansion.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-06-26 08:20:08 +05:30
Sébastien Han
ac5fd57387
chore: remove nested imports (#2515)
# What does this PR do?

* Given that our API packages use "import *" in `__init.py__` we don't
need to do `from llama_stack.apis.models.models` but simply from
llama_stack.apis.models. The decision to use `import *` is debatable and
should probably be revisited at one point.

* Remove unneeded Ruff F401 rule
* Consolidate Ruff F403 rule in the pyprojectfrom
llama_stack.apis.models.models

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-06-26 08:01:05 +05:30
Ben Browning
fa0b0c13d4
fix: Ollama should be optional in starter distro (#2482)
Some checks failed
Integration Tests / test-matrix (http, 3.13, vector_io) (push) Failing after 14s
Integration Tests / test-matrix (http, 3.13, scoring) (push) Failing after 11s
Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 9s
Integration Tests / test-matrix (http, 3.12, tool_runtime) (push) Failing after 18s
Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 7s
Integration Tests / test-matrix (http, 3.13, inspect) (push) Failing after 16s
Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 12s
Integration Tests / test-matrix (http, 3.13, tool_runtime) (push) Failing after 14s
Integration Tests / test-matrix (library, 3.13, inspect) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.13, agents) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.13, inference) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.13, datasets) (push) Failing after 10s
Test Llama Stack Build / generate-matrix (push) Successful in 7s
Python Package Build Test / build (3.12) (push) Failing after 4s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 5s
Integration Tests / test-matrix (library, 3.13, post_training) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.13, tool_runtime) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.13, providers) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.13, scoring) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.13, vector_io) (push) Failing after 8s
Update ReadTheDocs / update-readthedocs (push) Failing after 4s
Unit Tests / unit-tests (3.12) (push) Failing after 6s
Unit Tests / unit-tests (3.13) (push) Failing after 5s
Test Llama Stack Build / build (push) Failing after 6s
Test Llama Stack Build / build-single-provider (push) Failing after 1m10s
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 1m8s
Python Package Build Test / build (3.13) (push) Failing after 1m6s
Test External Providers / test-external-providers (venv) (push) Failing after 1m4s
Pre-commit / pre-commit (push) Successful in 2m33s
# What does this PR do?

Our starter distro required Ollama to be running (and a large list of
models available in that Ollama) to successfully start. This adjusts
things so that Ollama does not have to be running to use the starter
template / distro.

To accomplish this, a few changes were needed:

* The Ollama provider is now configurable whether it raises an Exception
or just logs a warning when it cannot reach the Ollama server on
startup. The default is to raise an exception (same as previous
behavior), but in the starter template we adjust this to just log a
warning so that we can bring the stack up without needing a running
Ollama server.

* The starter template no longer specifies a default list of models for
Ollama, as any models specified there need to actually be pulled and
available in Ollama. Instead, it adds a new
`OLLAMA_INFERENCE_MODEL` environment variable where users can provide an
optional model to register with the Ollama provider on startup.
Additional models can also be registered via the typical
`models.register(...)` at runtime.

* The vLLM template was adjusted to also allow an optional
`VLLM_INFERENCE_MODEL` specified on startup, so that the behavior
between vLLM and Ollama was consistent here to make it easy to get up
and running quickly.

* The default vector store was changed from sqlite-vec to faiss.
sqlite-vec can enabled via setting the `ENABLE_SQLITE_VEC` environment
variable, like we do for chromadb and pgvector. This is due to
sqlite-vec not shipping proper arm64 binaries, like we previously fixed
in #1530 for the ollama distribution.

## Test Plan

With this change, the following scenarios now work with the starter
template that did not before:

* no Ollama running
* Ollama running but not all of the Llama models pulled locally
* Ollama running with a custom model registered on startup
* vLLM running with a custom model registered on startup
* running the starter template on linux/arm64, like when running
containers on Mac without rosetta emulation

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-06-25 15:54:00 +02:00
Ben Browning
941f505eb0
feat: File search tool for Responses API (#2426)
# What does this PR do?

This is an initial working prototype of wiring up the `file_search`
builtin tool for the Responses API to our existing rag knowledge search
tool.

This is me seeing what I could pull together on top of the bits we
already have merged. This may not be the ideal way to implement this,
and things like how I shuffle the vector store ids from the original
response API tool request to the actual tool execution feel a bit hacky
(grep for `tool_kwargs["vector_db_ids"]` in `_execute_tool_call` to see
what I mean).

## Test Plan

I stubbed in some new tests to exercise this using text and pdf
documents.

Note that this is currently under tests/verification only because it
sometimes flakes with tool calling of the small Llama-3.2-3B model we
run in CI (and that I use as an example below). We'd want to make the
test a bit more robust in some way if we moved this over to
tests/integration and ran it in CI.

### OpenAI SaaS (to verify test correctness)

```
pytest -sv tests/verifications/openai_api/test_responses.py \
  -k 'file_search' \
  --base-url=https://api.openai.com/v1 \
  --model=gpt-4o
```

### Fireworks with faiss vector store

```
llama stack run llama_stack/templates/fireworks/run.yaml

pytest -sv tests/verifications/openai_api/test_responses.py \
  -k 'file_search' \
  --base-url=http://localhost:8321/v1/openai/v1 \
  --model=meta-llama/Llama-3.3-70B-Instruct
```

### Ollama with faiss vector store

This sometimes flakes on Ollama because the quantized small model
doesn't always choose to call the tool to answer the user's question.
But, it often works.

```
ollama run llama3.2:3b

INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run ./llama_stack/templates/ollama/run.yaml \
  --image-type venv \
  --env OLLAMA_URL="http://0.0.0.0:11434"

pytest -sv tests/verifications/openai_api/test_responses.py \
  -k'file_search' \
  --base-url=http://localhost:8321/v1/openai/v1 \
  --model=meta-llama/Llama-3.2-3B-Instruct
```

### OpenAI provider with sqlite-vec vector store

```
llama stack run ./llama_stack/templates/starter/run.yaml --image-type venv

 pytest -sv tests/verifications/openai_api/test_responses.py \
  -k 'file_search' \
  --base-url=http://localhost:8321/v1/openai/v1 \
  --model=openai/gpt-4o-mini
```

### Ensure existing vector store integration tests still pass

```
ollama run llama3.2:3b

INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run ./llama_stack/templates/ollama/run.yaml \
  --image-type venv \
  --env OLLAMA_URL="http://0.0.0.0:11434"

LLAMA_STACK_CONFIG=http://localhost:8321 \
pytest -sv tests/integration/vector_io \
  --text-model "meta-llama/Llama-3.2-3B-Instruct" \
  --embedding-model=all-MiniLM-L6-v2
```

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
2025-06-13 14:32:48 -04:00
Sébastien Han
6bb174bb05
revert: "chore: Remove zero-width space characters from OTEL service" (#2331)
# What does this PR do?

Revert #2060 and fix PLE2515.

---------

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-06-02 14:21:35 -07:00
ehhuang
2603f10f95
feat: support postgresql inference store (#2310)
Some checks failed
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 3s
Integration Tests / test-matrix (http, post_training) (push) Failing after 11s
Integration Tests / test-matrix (library, inference) (push) Failing after 13s
Integration Tests / test-matrix (http, providers) (push) Failing after 15s
Integration Tests / test-matrix (http, tool_runtime) (push) Failing after 16s
Integration Tests / test-matrix (http, datasets) (push) Failing after 18s
Integration Tests / test-matrix (http, scoring) (push) Failing after 16s
Integration Tests / test-matrix (http, agents) (push) Failing after 19s
Integration Tests / test-matrix (library, datasets) (push) Failing after 16s
Integration Tests / test-matrix (http, inspect) (push) Failing after 18s
Integration Tests / test-matrix (library, agents) (push) Failing after 18s
Integration Tests / test-matrix (http, inference) (push) Failing after 20s
Integration Tests / test-matrix (library, inspect) (push) Failing after 9s
Integration Tests / test-matrix (library, post_training) (push) Failing after 10s
Integration Tests / test-matrix (library, tool_runtime) (push) Failing after 8s
Test External Providers / test-external-providers (venv) (push) Failing after 8s
Integration Tests / test-matrix (library, scoring) (push) Failing after 9s
Integration Tests / test-matrix (library, providers) (push) Failing after 11s
Unit Tests / unit-tests (3.11) (push) Failing after 8s
Unit Tests / unit-tests (3.10) (push) Failing after 8s
Unit Tests / unit-tests (3.12) (push) Failing after 8s
Unit Tests / unit-tests (3.13) (push) Failing after 8s
Pre-commit / pre-commit (push) Successful in 57s
# What does this PR do?
* Added support postgresql inference store
* Added 'oracle' template that demos how to config postgresql stores
(except for telemetry, which is not supported currently)


## Test Plan

llama stack build --template oracle --image-type conda --run
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -s -v tests/integration/
--text-model accounts/fireworks/models/llama-v3p3-70b-instruct -k
'inference_store'
2025-05-29 14:33:09 -07:00
ehhuang
5844c2da68
feat: add list responses API (#2233)
# What does this PR do?
This is not part of the official OpenAI API, but we'll use this for the
logs UI.
In order to support more filtering options, I'm adopting the newly
introduced sql store in in place of the kv store.

## Test Plan
Added integration/unit tests.
2025-05-23 13:16:48 -07:00
ehhuang
549812f51e
feat: implement get chat completions APIs (#2200)
# What does this PR do?
* Provide sqlite implementation of the APIs introduced in
https://github.com/meta-llama/llama-stack/pull/2145.
* Introduced a SqlStore API: llama_stack/providers/utils/sqlstore/api.py
and the first Sqlite implementation
* Pagination support will be added in a future PR.

## Test Plan
Unit test on sql store:
<img width="1005" alt="image"
src="https://github.com/user-attachments/assets/9b8b7ec8-632b-4667-8127-5583426b2e29"
/>


Integration test:
```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" llama stack build --template ollama --image-type conda --run
```
```
LLAMA_STACK_CONFIG=http://localhost:5001 INFERENCE_MODEL="llama3.2:3b-instruct-fp16" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-fp16" -k 'inference_store and openai'
```
2025-05-21 22:21:52 -07:00
Charlie Doern
f02f7b28c1
feat: add huggingface post_training impl (#2132)
# What does this PR do?


adds an inline HF SFTTrainer provider. Alongside touchtune -- this is a
super popular option for running training jobs. The config allows a user
to specify some key fields such as a model, chat_template, device, etc

the provider comes with one recipe `finetune_single_device` which works
both with and without LoRA.

any model that is a valid HF identifier can be given and the model will
be pulled.

this has been tested so far with CPU and MPS device types, but should be
compatible with CUDA out of the box

The provider processes the given dataset into the proper format,
establishes the various steps per epoch, steps per save, steps per eval,
sets a sane SFTConfig, and runs n_epochs of training

if checkpoint_dir is none, no model is saved. If there is a checkpoint
dir, a model is saved every `save_steps` and at the end of training.


## Test Plan

re-enabled post_training integration test suite with a singular test
that loads the simpleqa dataset:
https://huggingface.co/datasets/llamastack/simpleqa and a tiny granite
model: https://huggingface.co/ibm-granite/granite-3.3-2b-instruct. The
test now uses the llama stack client and the proper post_training API

runs one step with a batch_size of 1. This test runs on CPU on the
Ubuntu runner so it needs to be a small batch and a single step.

[//]: # (## Documentation)

---------

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-05-16 14:41:28 -07:00
Charlie Doern
1ae61e8d5f
fix: replace all instances of --yaml-config with --config (#2196)
# What does this PR do?

start_stack.sh was using --yaml-config which is deprecated.

a bunch of distro docs also mentioned --yaml-config. Replaces all
instances and logic for --yaml-config with --config

resolves #2189

Signed-off-by: Charlie Doern <cdoern@redhat.com>
2025-05-16 14:31:12 -07:00
grs
b8f7e1504d
feat: allow the interface on which the server will listen to be configured (#2015)
# What does this PR do?

It may not always be desirable to listen on all interfaces, which is the
default. As an example, by listening instead only on a loopback
interface, the server cannot be reached except from within the host it
is run on. This PR makes this configurable, through a CLI option, an env
var or an entry on the config file.

## Test Plan

I ran a server with and without the added CLI argument to verify that
the argument is used if provided, but the default is as it was before if
not.

Signed-off-by: Gordon Sim <gsim@redhat.com>
2025-05-16 12:59:31 -07:00
Sébastien Han
26dffff92a
chore: remove pytest reports (#2156)
# What does this PR do?

Cleanup old test code too.

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-13 22:40:15 -07:00
Sébastien Han
6371bb1b33
chore(refact)!: simplify config management (#1105)
# What does this PR do?

We are dropping configuration via CLI flag almost entirely. If any
server configuration has to be tweak it must be done through the server
section in the run.yaml.

This is unfortunately a breaking change for whover was using:

* `--tls-*`
* `--disable_ipv6`

`--port` stays around and get a special treatment since we believe, it's
common for user dev to change port for quick experimentations.

Closes: https://github.com/meta-llama/llama-stack/issues/1076

## Test Plan

Simply do `llama stack run <config>` nothing should break :)

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-05-07 09:18:12 -07:00
Ashwin Bharambe
272d3359ee
fix: remove code interpeter implementation (#2087)
# What does this PR do?

The builtin implementation of code interpreter is not robust and has a
really weak sandboxing shell (the `bubblewrap` container). Given the
availability of better MCP code interpreter servers coming up, we should
use them instead of baking an implementation into the Stack and
expanding the vulnerability surface to the rest of the Stack.

This PR only does the removal. We will add examples with how to
integrate with MCPs in subsequent ones.

## Test Plan

Existing tests.
2025-05-01 14:35:08 -07:00
Sébastien Han
4412694018
chore: Remove zero-width space characters from OTEL service name env var defaults (#2060)
# What does this PR do?

Replaced `${env.OTEL_SERVICE_NAME:\u200B}` and similar variants with
properly formatted `${env.OTEL_SERVICE_NAME:}` across all YAML templates
and TelemetryConfig. This prevents silent parsing issues and ensures
consistent environment variable resolution.
Slipped in https://github.com/meta-llama/llama-stack/pull/2058

Signed-off-by: Sébastien Han <seb@redhat.com>
2025-04-30 17:56:46 +02:00
Roland Huß
5a2bfd6ad5
refactor: Replace SQLITE_DB_PATH by SQLITE_STORE_DIR env in templates (#2055)
# What does this PR do?

The telemetry provider configs is the only one who leverages the env var
`SQLITE_DB_PATH` for pointing to persistent data in the respective
templates, whereas usually `SQLITE_STORE_DIR` is used.

This PR modifies the `sqlite_db_path` in various telemetry configuration
files to use the environment variable `SQLITE_STORE_DIR` instead of
`SQLITE_DB_PATH`. This change ensures that _only_ the SQLITE_STORE_DIR
needs to be set to point to a different persistence location for
providers.

All references to `SQLITE_DB_PATH` have been removed.

Another improvement could be to move `sqlite_db_path` to `db_path` in
the telemetry provider config, to align with the other provider
configurations. That could be done by another PR (if wanted).
2025-04-29 15:28:10 -07:00
ehhuang
2f38851751
chore: Revert "chore(telemetry): remove service_name entirely" (#1785)
Reverts meta-llama/llama-stack#1755 closes #1781
2025-03-25 14:42:05 -07:00
ehhuang
b9fbfed216
chore(telemetry): remove service_name entirely (#1755)
# What does this PR do?


## Test Plan

LLAMA_STACK_CONFIG=dev pytest -s -v
tests/integration/agents/test_agents.py::test_custom_tool
--safety-shield meta-llama/Llama-Guard-3-8B --text-model
accounts/fireworks/models/llama-v3p1-8b-instruct

and verify trace in jaeger UI
https://llama-stack.readthedocs.io/en/latest/building_applications/telemetry.html#
2025-03-21 15:11:56 -07:00
ehhuang
34f89bfbd6
feat(telemetry): use zero-width space to avoid clutter (#1754)
# What does this PR do?
Before 
<img width="858" alt="image"
src="https://github.com/user-attachments/assets/6cefb1ae-5603-4818-85ea-a0c337b986bc"
/>

Note the redundant 'llama-stack' in front of every span

## Test Plan
<img width="1171" alt="image"
src="https://github.com/user-attachments/assets/bdc5fd5b-ff1f-4f10-8b40-cff2ea93dd1f"
/>
2025-03-21 12:02:10 -07:00
Hardik Shah
127bac6869
fix: Default to port 8321 everywhere (#1734)
As titled, moved all instances of 5001 to 8321
2025-03-20 15:50:41 -07:00
Hardik Shah
581e8ae562
fix: docker run with --pull always to fetch the latest image (#1733)
As titled
2025-03-20 15:35:48 -07:00
Ashwin Bharambe
d072b5fa0c
test: add unit test to ensure all config types are instantiable (#1601) 2025-03-12 22:29:58 -07:00
Ashwin Bharambe
dc84bc755a
fix: revert to using faiss for ollama distro (#1530)
This is unfortunate because `sqlite-vec` seems promising. But its PIP
package is not quite complete. It does not have binary for arm64 (I
think, or maybe it even lacks 64 bit builds?) which results in the arm64
container resulting in
```
File "/usr/local/lib/python3.10/site-packages/sqlite_vec/init.py", line 17, in load
    conn.load_extension(loadable_path())
sqlite3.OperationalError: /usr/local/lib/python3.10/site-packages/sqlite_vec/vec0.so: wrong ELF class: ELFCLASS32
```

To get around I tried to install from source via `uv pip install
sqlite-vec --no-binary=sqlite-vec` however it even lacks a source
distribution which makes that impossible.

## Test Plan

Build the container locally using: 

```bash
LLAMA_STACK_DIR=. llama stack build --template ollama --image-type container
```

Run the container as: 

```
podman run --privileged -it -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
   -v ~/.llama:/root/.llama \
    --env INFERENCE_MODEL=$INFERENCE_MODEL \
    --env OLLAMA_URL=http://host.containers.internal:11434 \
    -v ~/local/llama-stack:/app/llama-stack-source 
    localhost/distribution-ollama:dev --port $LLAMA_STACK_PORT
```

Verify the container starts up correctly. Without this patch, it would
encounter the ELFCLASS32 error.
2025-03-10 16:15:17 -07:00
Reid
0b8cb830b9
docs: update ollama doc url (#1508)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

It should changed in this pr
https://github.com/meta-llama/llama-stack/pull/1190/files#diff-53e3f35ced54ee5e57dc8b0d3b04770ed84f2f6434c6f492f42569b3c2810ecd

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-03-10 13:04:59 -07:00
Ashwin Bharambe
abfbaf3c1b
refactor(test): move tools, evals, datasetio, scoring and post training tests (#1401)
All of the tests from `llama_stack/providers/tests/` are now moved to
`tests/integration`.

I converted the `tools`, `scoring` and `datasetio` tests to use API.
However, `eval` and `post_training` proved to be a bit challenging to
leaving those. I think `post_training` should be relatively
straightforward also.

As part of this, I noticed that `wolfram_alpha` tool wasn't added to
some of our commonly used distros so I added it. I am going to remove a
lot of code duplication from distros next so while this looks like a
one-off right now, it will go away and be there uniformly for all
distros.
2025-03-04 14:53:47 -08:00
Ashwin Bharambe
dd0db8038b
refactor(test): unify vector_io tests and make them configurable (#1398)
## Test Plan


`LLAMA_STACK_CONFIG=inference=sentence-transformers,vector_io=sqlite-vec
pytest -s -v test_vector_io.py --embedding-model all-miniLM-L6-V2
--inference-model='' --vision-inference-model=''`

```
test_vector_io.py::test_vector_db_retrieve[txt=:vis=:emb=all-miniLM-L6-V2] PASSED
test_vector_io.py::test_vector_db_register[txt=:vis=:emb=all-miniLM-L6-V2] PASSED
test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case0] PASSED
test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case1] PASSED
test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case2] PASSED
test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case3] PASSED
test_vector_io.py::test_insert_chunks[txt=:vis=:emb=all-miniLM-L6-V2-test_case4] PASSED
```

Same thing with:
- LLAMA_STACK_CONFIG=inference=sentence-transformers,vector_io=faiss
- LLAMA_STACK_CONFIG=fireworks

(Note that ergonomics will soon be improved re: cmd-line options and env
variables)
2025-03-04 13:37:45 -08:00
Ashwin Bharambe
4c8a0fa8dc fix: ensure ollama embedding model is registered properly in the template 2025-02-27 22:49:06 -08:00
Reid
c2d2a80b0a
docs: update the output of llama-stack-client models list (#1271)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

[//]: # (## Documentation)

Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
2025-02-27 16:46:38 -08:00
Ashwin Bharambe
04de2f84e9
fix: register provider model name and HF alias in run.yaml (#1304)
Each model known to the system has two identifiers: 

- the `provider_resource_id` (what the provider calls it) -- e.g.,
`accounts/fireworks/models/llama-v3p1-8b-instruct`
- the `identifier` (`model_id`) under which it is registered and gets
routed to the appropriate provider.

We have so far used the HuggingFace repo alias as the standardized
identifier you can use to refer to the model. So in the above example,
we'd use `meta-llama/Llama-3.1-8B-Instruct` as the name under which it
gets registered. This makes it convenient for users to refer to these
models across providers.

However, we forgot to register the _actual_ provider model ID also. You
should be able to route via `provider_resource_id` also, of course.

This change fixes this (somewhat grave) omission.

*Note*: this change is additive -- more aliases work now compared to
before.

## Test Plan

Run the following for distro=(ollama fireworks together)
```
LLAMA_STACK_CONFIG=$distro \
   pytest -s -v tests/client-sdk/inference/test_text_inference.py \
   --inference-model=meta-llama/Llama-3.1-8B-Instruct --vision-inference-model=""
```
2025-02-27 16:39:23 -08:00
Shrey
30ef1c3680
feat: Add model context protocol tools with ollama provider (#1283)
# What does this PR do?
Model context protocol (MCP) allows for remote tools to be connected
with Agents. The current Ollama provider does not support it. This PR
adds necessary code changes to ensure that the integration between
Ollama backend and MCP works.

This PR is an extension of #816 for Ollama. 

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]

1. Run llama-stack server with the command:
```
llama stack build --template ollama --image-type conda
llama stack run ./templates/ollama/run.yaml \
  --port $LLAMA_STACK_PORT \
  --env INFERENCE_MODEL=$INFERENCE_MODEL \
  --env OLLAMA_URL=http://localhost:11434
```

2. Run the sample client agent with MCP tool:
```
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig
from llama_stack_client.types.shared_params.url import URL
from llama_stack_client import LlamaStackClient
from termcolor import cprint

## Start the local MCP server
# git clone https://github.com/modelcontextprotocol/python-sdk
# Follow instructions to get the env ready
# cd examples/servers/simple-tool
# uv run mcp-simple-tool --transport sse --port 8000

# Connect to the llama stack server
base_url="http://localhost:8321"
model_id="meta-llama/Llama-3.2-3B-Instruct"
client = LlamaStackClient(base_url=base_url)


# Register MCP tools
client.toolgroups.register(
    toolgroup_id="mcp::filesystem",
    provider_id="model-context-protocol",
    mcp_endpoint=URL(uri="http://localhost:8000/sse"))

# Define an agent with MCP toolgroup 
agent_config = AgentConfig(
    model=model_id,
    instructions="You are a helpful assistant",
    toolgroups=["mcp::filesystem"],
    input_shields=[],
    output_shields=[],
    enable_session_persistence=False,
)
agent = Agent(client, agent_config)
user_prompts = [
    "Fetch content from https://www.google.com and print the response"
]

# Run a session with the agent
session_id = agent.create_session("test-session")
for prompt in user_prompts:
    cprint(f"User> {prompt}", "green")
    response = agent.create_turn(
        messages=[
            {
                "role": "user",
                "content": prompt,
            }
        ],
        session_id=session_id,
    )
    for log in EventLogger().log(response):
        log.print()
```
# Documentation
The file docs/source/distributions/self_hosted_distro/ollama.md is
updated to indicate the MCP tool runtime availability.

Signed-off-by: Shreyanand <shanand@redhat.com>
2025-02-26 15:38:18 -08:00
Ashwin Bharambe
992f865b2e
chore: move embedding deps to RAG tool where they are needed (#1210)
`EMBEDDING_DEPS` were wrongly associated with `vector_io` providers.
They are needed by
https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/utils/memory/vector_store.py#L142
and related code and is used by the RAG tool and as such should only be
needed by the `inline::rag-runtime` provider.
2025-02-21 11:33:41 -08:00
Ashwin Bharambe
11697f85c5
fix: pull ollama embedding model if necessary (#1209)
Embedding models are tiny and can be pulled on-demand. Let's do that so
the user doesn't have to do "yet another thing" to get themselves set
up.

Thanks @hardikjshah for the suggestion.

Also fixed a build dependency miss (TODO: distro_codegen needs to
actually check that the build template contains all providers mentioned
for the run.yaml file)

## Test Plan 

First run `ollama rm all-minilm:latest`. 

Run `llama stack build --template ollama && llama stack run ollama --env
INFERENCE_MODEL=llama3.2:3b-instruct-fp16`. See that it outputs a
"Pulling embedding model `all-minilm:latest`" output and the stack
starts up correctly. Verify that `ollama list` shows the model is
correctly downloaded.
2025-02-21 10:35:56 -08:00
Ashwin Bharambe
35ae0e16a1 Fix sqlite_vec config defaults 2025-02-20 17:50:33 -08:00
Ashwin Bharambe
9436dd570d
feat: register embedding models for ollama, together, fireworks (#1190)
# What does this PR do?

We have support for embeddings in our Inference providers, but so far we
haven't done the final step of actually registering the known embedding
models and making sure they are extremely easy to use. This is one step
towards that.

## Test Plan

Run existing inference tests.

```bash

$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
   --inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$  pytest -s -v -k together test_embeddings.py \
   --inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
   --inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```

The value of the EMBEDDING_DIMENSION isn't actually used in these tests,
it is merely used by the test fixtures to check if the model is an LLM
or Embedding.
2025-02-20 15:39:08 -08:00
Ashwin Bharambe
07ccf908f7 ModelAlias -> ProviderModelEntry 2025-02-20 14:02:36 -08:00
Ben Browning
e9b8259cf9
fix: Get distro_codegen.py working with default deps and enabled in pre-commit hooks (#1123)
# What does this PR do?

Before this change, `distro_codegen.py` would only work if the user
manually installed multiple provider-specific dependencies (see #1122).
Now, users can run `distro_codegen.py` without any provider-specific
dependencies because we avoid importing the entire provider
implementations just to get the config needed to build the provider
template.

Concretely, this mostly means moving the
MODEL_ALIASES (and related variants) definitions to a new models.py
class within the provider implementation for those providers that
require additional dependencies. It also meant moving a couple of
imports from top-level imports to inside `get_adapter_impl` for some
providers, which follows the pattern used by multiple existing
providers.

To ensure we don't regress and accidentally add new imports that cause
distro_codegen.py to fail, the stubbed-in pre-commit hook for
distro_codegen.py was uncommented and slightly tweaked to run via `uv
run python ...` to ensure it runs with only the project's default
dependencies and to run automatically instead of manually.

Lastly, this updates distro_codegen.py itself to keep track of paths it
might have changed and to only `git diff` those specific paths when
checking for changed files instead of doing a diff on the entire working
tree. The latter was overly broad and would require a user have no other
unstaged changes in their working tree, even if those unstaged changes
were unrelated to generated code. Now it only flags uncommitted changes
for paths distro_codegen.py actually writes to.

Our generated code was also out-of-date, presumably because of these
issues, so this commit also has some updates to the generated code
purely because it was out of sync, and the pre-commit hook now enforces
things to be updated.

(Closes #1122)

## Test Plan

I manually tested distro_codegen.py and the pre-commit hook to verify
those work as expected, flagging any uncommited changes and catching any
imports that attempt to pull in provider-specific dependencies.

However, I do not have valid api keys to the impacted provider
implementations, and am unable to easily run the inference tests against
each changed provider. There are no functional changes to the provider
implementations here, but I'd appreciate a second set of eyes on the
changed import statements and moving of MODEL_ALIASES type code to a
separate models.py to ensure I didn't make any obvious errors.

---------

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
2025-02-19 18:39:20 -08:00
Xi Yan
8b655e3cd2
fix!: update eval-tasks -> benchmarks (#1032)
# What does this PR do?

- Update `/eval-tasks` to `/benchmarks`
- ⚠️ Remove differentiation between `app` v.s. `benchmark` eval task
config. Now we only have `BenchmarkConfig`. The overloaded `benchmark`
is confusing and do not add any value. Backward compatibility is being
kept as the "type" is not being used anywhere.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
- This change is backward compatible 
- Run notebook test with

```
pytest -v -s --nbval-lax ./docs/getting_started.ipynb
pytest -v -s --nbval-lax ./docs/notebooks/Llama_Stack_Benchmark_Evals.ipynb
```

<img width="846" alt="image"
src="https://github.com/user-attachments/assets/d2fc06a7-593a-444f-bc1f-10ab9b0c843d"
/>



[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)

---------

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Co-authored-by: Ben Browning <ben324@gmail.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com>
Co-authored-by: reidliu <reid201711@gmail.com>
Co-authored-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-13 16:40:58 -08:00
Yuan Tang
8ff27b58fa
chore: Consistent naming for VectorIO providers (#1023)
# What does this PR do?

This changes all VectorIO providers classes to follow the pattern
`<ProviderName>VectorIOConfig` and `<ProviderName>VectorIOAdapter`. All
API endpoints for VectorIOs are currently consistent with `/vector-io`.

Note that API endpoint for VectorDB stay unchanged as `/vector-dbs`. 

## Test Plan

I don't have a way to test all providers. This is a simple renaming so
things should work as expected.

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-02-13 13:15:49 -05:00
Francisco Arceo
119fe8742a
feat: Adding sqlite-vec as a vectordb (#1040)
# What does this PR do?
This PR adds `sqlite_vec` as an additional inline vectordb.

Tested with `ollama` by adding the `vector_io` object in
`./llama_stack/templates/ollama/run.yaml` :

```yaml
  vector_io:
  - provider_id: sqlite_vec
    provider_type: inline::sqlite_vec
    config:
      kvstore:
        type: sqlite
        namespace: null
        db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/sqlite_vec.db
      db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/sqlite_vec.db
```
I also updated the `./tests/client-sdk/vector_io/test_vector_io.py` test
file with:
```python
INLINE_VECTOR_DB_PROVIDERS = ["faiss", "sqlite_vec"]
```
And parameterized the relevant tests. 

[//]: # (If resolving an issue, uncomment and update the line below)
# Closes 
https://github.com/meta-llama/llama-stack/issues/1005

## Test Plan
I ran the tests with:
```bash
INFERENCE_MODEL=llama3.2:3b-instruct-fp16 LLAMA_STACK_CONFIG=ollama pytest -s -v tests/client-sdk/vector_io/test_vector_io.py
```
Which outputs:
```python
...
PASSED
tests/client-sdk/vector_io/test_vector_io.py::test_vector_db_retrieve[all-MiniLM-L6-v2-sqlite_vec] PASSED
tests/client-sdk/vector_io/test_vector_io.py::test_vector_db_list PASSED
tests/client-sdk/vector_io/test_vector_io.py::test_vector_db_register[all-MiniLM-L6-v2-faiss] PASSED
tests/client-sdk/vector_io/test_vector_io.py::test_vector_db_register[all-MiniLM-L6-v2-sqlite_vec] PASSED
tests/client-sdk/vector_io/test_vector_io.py::test_vector_db_unregister[faiss] PASSED
tests/client-sdk/vector_io/test_vector_io.py::test_vector_db_unregister[sqlite_vec] PASSED
```

In addition, I ran the `rag_with_vector_db.py`
[example](https://github.com/meta-llama/llama-stack-apps/blob/main/examples/agents/rag_with_vector_db.py)
using the script below with `uv run rag_example.py`.
<details>
<summary>CLICK TO SHOW SCRIPT 👋  </summary>

```python
#!/usr/bin/env python3
import os
import uuid
from termcolor import cprint

# Set environment variables
os.environ['INFERENCE_MODEL'] = 'llama3.2:3b-instruct-fp16'
os.environ['LLAMA_STACK_CONFIG'] = 'ollama'

# Import libraries after setting environment variables
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig
from llama_stack_client.types import Document


def main():
    # Initialize the client
    client = LlamaStackAsLibraryClient("ollama")
    vector_db_id = f"test-vector-db-{uuid.uuid4().hex}"

    _ = client.initialize()

    model_id = 'llama3.2:3b-instruct-fp16'

    # Define the list of document URLs and create Document objects
    urls = [
        "chat.rst",
        "llama3.rst",
        "memory_optimizations.rst",
        "lora_finetune.rst",
    ]
    documents = [
        Document(
            document_id=f"num-{i}",
            content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
            mime_type="text/plain",
            metadata={},
        )
        for i, url in enumerate(urls)
    ]
    # (Optional) Use the documents as needed with your client here

    client.vector_dbs.register(
        provider_id='sqlite_vec',
        vector_db_id=vector_db_id,
        embedding_model="all-MiniLM-L6-v2",
        embedding_dimension=384,
    )

    client.tool_runtime.rag_tool.insert(
        documents=documents,
        vector_db_id=vector_db_id,
        chunk_size_in_tokens=512,
    )
    # Create agent configuration
    agent_config = AgentConfig(
        model=model_id,
        instructions="You are a helpful assistant",
        enable_session_persistence=False,
        toolgroups=[
            {
                "name": "builtin::rag",
                "args": {
                    "vector_db_ids": [vector_db_id],
                }
            }
        ],
    )

    # Instantiate the Agent
    agent = Agent(client, agent_config)

    # List of user prompts
    user_prompts = [
        "What are the top 5 topics that were explained in the documentation? Only list succinct bullet points.",
        "Was anything related to 'Llama3' discussed, if so what?",
        "Tell me how to use LoRA",
        "What about Quantization?",
    ]

    # Create a session for the agent
    session_id = agent.create_session("test-session")

    # Process each prompt and display the output
    for prompt in user_prompts:
        cprint(f"User> {prompt}", "green")
        response = agent.create_turn(
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
            session_id=session_id,
        )
        # Log and print events from the response
        for log in EventLogger().log(response):
            log.print()


if __name__ == "__main__":
    main()
```
</details>

Which outputs a large summary of RAG generation.

# Documentation

Will handle documentation updates in follow-up PR.

# (- [ ] Added a Changelog entry if the change is significant)

---------

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
2025-02-12 10:50:03 -08:00
Ellis Tarn
36d35406a7
fix: a bad newline in ollama docs (#1036)
# What does this PR do?
Catches a bug in the previous codegen which was removing newlines.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
```
python llama_stack/scripts/distro_codegen.py
```

[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)
2025-02-10 14:27:17 -08:00
Ellis Tarn
ab9516c789
fix: Gaps in doc codegen (#1035)
# What does this PR do?
Catches docs up to source with:
```
python llama_stack/scripts/distro_codegen.py
```

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
Manually checked
```
sphinx-autobuild docs/source build/html
```

[//]: # (## Documentation)
[//]: # (- [ ] Added a Changelog entry if the change is significant)
2025-02-10 13:24:15 -08:00
Ashwin Bharambe
5b1e69e58e
Use uv pip install instead of pip install (#921)
## What does this PR do? 

See issue: #747 -- `uv` is just plain better. This PR does the bare
minimum of replacing `pip install` by `uv pip install` and ensuring `uv`
exists in the environment.

## Test Plan 

First: create new conda, `uv pip install -e .` on `llama-stack` -- all
is good.
Next: run `llama stack build --template together` followed by `llama
stack run together` -- all good
Next: run `llama stack build --template together --image-name yoyo`
followed by `llama stack run together --image-name yoyo` -- all good
Next: fresh conda and `uv pip install -e .` and `llama stack build
--template together --image-type venv` -- all good.

Docker: `llama stack build --template together --image-type container`
works!
2025-01-31 22:29:41 -08:00
Hardik Shah
2cebb24d3a
Update doc templates for running safety on self-hosted templates (#874) 2025-01-24 11:28:20 -08:00
Ashwin Bharambe
4d7c8c797f Kill colons 2025-01-22 22:54:13 -08:00
Ashwin Bharambe
f3d8864c36 Rename builtin::memory -> builtin::rag 2025-01-22 20:22:51 -08:00
Sixian Yi
597869a2aa
add distro report (#847)
# What does this PR do?

Generate distro reports to cover inference, agents, and vector_io. 


## Test Plan

Report generated through `/opt/miniconda3/envs/stack/bin/pytest -s -v
tests/client-sdk/ --report`


## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2025-01-22 19:20:49 -08:00
Ashwin Bharambe
c9e5578151
[memory refactor][5/n] Migrate all vector_io providers (#835)
See https://github.com/meta-llama/llama-stack/issues/827 for the broader
design.

This PR finishes off all the stragglers and migrates everything to the
new naming.
2025-01-22 10:17:59 -08:00
Yuan Tang
6da3053c0e
More generic image type for OCI-compliant container technologies (#802)
It's a more generic term and applicable to alternatives of Docker, such
as Podman or other OCI-compliant technologies.

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
2025-01-17 16:37:42 -08:00
Dinesh Yeduguru
73215460ba
add default toolgroups to all providers (#795)
# What does this PR do?

Add toolgroup defs to all the distribution templates
2025-01-16 16:54:59 -08:00
Xi Yan
d1f3b032c9
cerebras template update for memory (#792)
# What does this PR do?

- we no longer have meta-reference as memory provider, update cerebras
template


## Test Plan

```
python llama_stack/scripts/distro_codegen.py
```

## Sources

Please link relevant resources if necessary.


## Before submitting

- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
      Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
2025-01-16 16:07:53 -08:00