# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds support for the new Streamable HTTP transport for MCP, as
well as falling back to the SSE protocol if the Streamable HTTP
connection fails.
<!-- If resolving an issue, uncomment and update the line below -->
Closes#2542
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Calum Murray <cmurray@redhat.com>
# What does this PR do?
currently `print` is being used with custom formatting to achieve
telemetry output in the console_span_processor
This causes telemetry not to show up in log files when using
`LLAMA_STACK_LOG_FILE`. During testing it looks like telemetry is not
being captured when it is
switch to using Rich formatting with the logger and then strip the
formatting off when a log file is being used so the formatting looks
normal
## Test Plan
before:
console:
<img width="967" height="127" alt="Screenshot 2025-07-21 at 4 02 15 PM"
src="https://github.com/user-attachments/assets/b09518cc-9d38-4970-9877-70e2c41fcbb5"
/>
log file (no telemetry):
```
2025-07-21 16:01:32,481 llama_stack.providers.remote.inference.ollama.ollama:117 inference: checking connectivity to Ollama at `http://localhost:11434`...
2025-07-21 16:01:34,779 opentelemetry.trace:537 uncategorized: Overriding of current TracerProvider is not allowed
2025-07-21 16:01:35,083 __main__:587 server: Listening on ['::', '0.0.0.0']:8321
2025-07-21 16:01:35,091 uvicorn.error:84 uncategorized: Started server process [68679]
2025-07-21 16:01:35,091 uvicorn.error:48 uncategorized: Waiting for application startup.
2025-07-21 16:01:35,092 __main__:163 server: Starting up
2025-07-21 16:01:35,092 uvicorn.error:62 uncategorized: Application startup complete.
2025-07-21 16:01:35,092 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
2025-07-21 16:01:37,167 uvicorn.access:473 uncategorized: 127.0.0.1:53145 - "POST /v1/openai/v1/chat/completions HTTP/1.1" 200
```
after:
console:
<img width="797" height="165" alt="Screenshot 2025-07-22 at 3 28 44 PM"
src="https://github.com/user-attachments/assets/44d40e3b-6502-439d-9ea5-38058b289962"
/>
log file:
```
2025-07-21 15:59:51,481 llama_stack.providers.remote.inference.ollama.ollama:117 inference: checking connectivity to Ollama at `http://localhost:11434`...
2025-07-21 15:59:53,801 opentelemetry.trace:537 uncategorized: Overriding of current TracerProvider is not allowed
2025-07-21 15:59:54,059 __main__:587 server: Listening on ['::', '0.0.0.0']:8321
2025-07-21 15:59:54,066 uvicorn.error:84 uncategorized: Started server process [68578]
2025-07-21 15:59:54,067 uvicorn.error:48 uncategorized: Waiting for application startup.
2025-07-21 15:59:54,067 __main__:163 server: Starting up
2025-07-21 15:59:54,067 uvicorn.error:62 uncategorized: Application startup complete.
2025-07-21 15:59:54,068 uvicorn.error:216 uncategorized: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
2025-07-21 15:59:55,381 [TELEMETRY] 19:59:55.381 /v1/openai/v1/chat/completions
2025-07-21 15:59:55,619 uvicorn.access:473 uncategorized: 127.0.0.1:53102 - "POST /v1/openai/v1/chat/completions HTTP/1.1" 200
2025-07-21 15:59:55,621 [TELEMETRY] 19:59:55.621 /v1/openai/v1/chat/completions [StatusCode.OK] (240.07ms)
2025-07-21 15:59:55,622 [TELEMETRY] 19:59:55.620 127.0.0.1:53102 - "POST /v1/openai/v1/chat/completions HTTP/1.1" 200
```
Signed-off-by: Charlie Doern <cdoern@redhat.com>
This flips #2823 and #2805 by making the Stack periodically query the
providers for models rather than the providers going behind the back and
calling "register" on to the registry themselves. This also adds support
for model listing for all other providers via `ModelRegistryHelper`.
Once this is done, we do not need to manually list or register models
via `run.yaml` and it will remove both noise and annoyance (setting
`INFERENCE_MODEL` environment variables, for example) from the new user
experience.
In addition, it adds a configuration variable `allowed_models` which can
be used to optionally restrict the set of models exposed from a
provider.
# What does this PR do?
openai/models.py has backward compat entries for litellm model names.
the starter template includes these in the list of registered models.
the inclusion results in duplicate model registrations.
the backward compat is no longer necessary.
## Test Plan
ci
# What does this PR do?
This PR implements the openai compatible endpoints for chromadb
Closes#2462
## Test Plan
Ran ollama llama stack server and ran the command
`pytest -sv --stack-config=http://localhost:8321
tests/integration/vector_io/test_openai_vector_stores.py
--embedding-model all-MiniLM-L6-v2`
8 failed, 27 passed, 8 skipped, 1 xfailed
The failed ones are regarding files api
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Co-authored-by: sarthakdeshpande <sarthak.deshpande@engati.com>
Co-authored-by: Francisco Javier Arceo <farceo@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
Moving vector store and vector store files helper methods to
`openai_vector_store_mixin.py`
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
The tests are already supported in the CI and tests the inline providers
and current integration tests.
Note that the `vector_index` fixture will be test `milvus_vec_adapter`,
`faiss_vec_adapter`, and `sqlite_vec_adapter` in
`tests/unit/providers/vector_io/test_vector_io_openai_vector_stores.py`.
Additionally, the integration tests in `integration-vector-io-tests.yml`
runs `tests/integration/vector_io` tests for the following providers:
```python
vector-io-provider: ["inline::faiss", "inline::sqlite-vec", "inline::milvus", "remote::chromadb", "remote::pgvector"]
```
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
add an `OpenAIMixin` for use by inference providers who remote endpoints
support an OpenAI compatible API.
use is demonstrated by refactoring
- OpenAIInferenceAdapter
- NVIDIAInferenceAdapter (adds embedding support)
- LlamaCompatInferenceAdapter
## Test Plan
existing unit and integration tests
This PR updates model registration and lookup behavior to be slightly
more general / flexible. See
https://github.com/meta-llama/llama-stack/issues/2843 for more details.
Note that this change is backwards compatible given the design of the
`lookup_model()` method.
## Test Plan
Added unit tests
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR fixes flaky telemetry tests
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
See https://github.com/meta-llama/llama-stack/pull/2814
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
chore: Making name optional in openai_create_vector_store
# Closes https://github.com/meta-llama/llama-stack/issues/2706
## Test Plan
CI and unit tests
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Ensures that session turns retrieved from the agent persistence layer
are sorted by their `started_at` timestamp, as the key-value store does
not guarantee order.
Closes#2852
## Test Plan
- [ ] Add unit tests
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
minor update of the pgvector doc, changing 'faiss' to 'pgvector'
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Just like #2805 but for vLLM.
We also make VLLM_URL env variable optional (not required) -- if not
specified, the provider silently sits idle and yells eventually if
someone tries to call a completion on it. This is done so as to allow
this provider to be present in the `starter` distribution.
## Test Plan
Set up vLLM, copy the starter template and set `{ refresh_models: true,
refresh_models_interval: 10 }` for the vllm provider and then run:
```
ENABLE_VLLM=vllm VLLM_URL=http://localhost:8000/v1 \
uv run llama stack run --image-type venv /tmp/starter.yaml
```
Verify that `llama-stack-client models list` brings up the model
correctly from vLLM.
Inline _inference_ providers haven't proved to be very useful -- they
are rarely used. And for good reason -- it is almost never a good idea
to include a complex (distributed) inference engine bundled into a
distributed stateful front-end server serving many other things.
Responsibility should be split properly.
See Discord discussion:
1395849853
For self-hosted providers like Ollama (or vLLM), the backing server is
running a set of models. That server should be treated as the source of
truth and the Stack registry should just be a cache for those models. Of
course, in production environments, you may not want this (because you
know what model you are running statically) hence there's a config
boolean to control this behavior.
_This is part of a series of PRs aimed at removing the requirement of
needing to set `INFERENCE_MODEL` env variables for running Llama Stack
server._
## Test Plan
Copy and modify the starter.yaml template / config and enable
`refresh_models: true, refresh_models_interval: 10` for the ollama
provider. Then, run:
```
LLAMA_STACK_LOGGING=all=debug \
ENABLE_OLLAMA=ollama uv run llama stack run --image-type venv /tmp/starter.yaml
```
See a gargantuan amount of logs, but verify that the provider is
periodically refreshing models. Stop and prune a model from ollama
server, restart the server. Verify that the model goes away when I call
`uv run llama-stack-client models list`
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
let's users register models available at
https://integrate.api.nvidia.com/v1/models that isn't already in
llama_stack/providers/remote/inference/nvidia/models.py
## Test Plan
1. run the nvidia distro
2. register a model from https://integrate.api.nvidia.com/v1/models that
isn't already know, as of this writing
nvidia/llama-3.1-nemotron-ultra-253b-v1 is a good example
3. perform inference w/ the model
# What does this PR do?
Resolves https://github.com/meta-llama/llama-stack/issues/2770. It
replaces characters in SQLite table names that are not alphanumeric or
underscores with underscores and quotes the table names with square
brackets in SQL statements.
Closes #[2770]
## Test Plan
I added a ".123" suffix to the bank_id on the following line
```
index = await SQLiteVecIndex.create(dimension=embedding_dimension, db_path=db_path, bank_id="test_bank.123")
```
in tests/unit/providers/vector_io/test_sqlite_vec.py, which, without the
fix in place, demonstrates the issue.
# What does this PR do?
this was causing an unnessessary logger warning
## Test Plan
Run `LLAMA_STACK_DIR=. ENABLE_OLLAMA=ollama
OLLAMA_INFERENCE_MODEL=llama3.2:3b llama stack build --template starter
--image-type venv --run` and then `Crtl-C` to shutdown
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
The vision models are now available at the standard URL, so the
workaround code has been removed. This also simplifies the codebase by
eliminating the need for per-model client caching.
- Remove special URL handling for meta/llama-3.2-11b/90b-vision-instruct
models
- Convert _get_client method to _client property for cleaner API
- Remove unnecessary lru_cache decorator and functools import
- Simplify client creation logic to use single base URL for all models
# What does this PR do?
Adding OpenAI Vector Stores Files API compatibility for PGVector
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Updated CI to include PGVector
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Resolves https://github.com/meta-llama/llama-stack/issues/2735
Currently, if you test against OpenAI's Vector Stores API the
`client.vector_stores.search` call fails with an invalid vector_db
during routing (see the script referenced in the clickable item under
the Test Plan section).
This PR ensures that `client.vector_stores.search()` is compatible with
OpenAI's Vector Stores API.
Two biggest changes:
1. The `name`, which was previously used as the `vector_db_id`, has been
changed to be consistent with OpenAI's `vs_{uuid}` format.
2. The vector store ID has to be referenced by the ID, the name is not
reliable as every `client.vector_stores.create` results in a new vector
store.
NOTE: I believe this is a breaking change for end users as they'll need
to update their VectorDB identifiers.
## Test Plan
Unit tests:
```bash
./scripts/unit-tests.sh tests/unit/providers/vector_io/ -v
```
Integration tests:
```bash
ENABLE_MILVUS=milvus llama stack run /Users/farceo/dev/llama-stack/llama_stack/templates/starter/run.yaml --image-type venv
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/vector_io/test_openai_vector_stores.py --embedding-model=all-MiniLM-L6-v2 -vv
```
Unit tests and test script below 👇
<details>
<summary>Click here for script used to test OpenAI and Llama Stack
Vector Store implementation</summary>
```python
import json
import argparse
from openai import OpenAI, pagination
import logging
from colorama import Fore, Style, init
import traceback
import os
# Initialize colorama for color support in terminal
init(autoreset=True)
# Setup basic logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
DEMO_VECTOR_STORE_NAME = "Support FAQ FJA"
global DEMO_VECTOR_STORE_ID
global DEMO_VECTOR_STORE_ID2
def colored_print(color, text):
"""Prints text to the console with the specified color."""
print(f"{color}{text}{Style.RESET_ALL}")
def log_and_print(color, message, level=logging.INFO):
"""Logs a message and prints it to the console with the specified color."""
logging.log(level, message)
colored_print(color, message)
def run_tests(client, prefix="openai"):
"""
Runs all tests using the provided OpenAI client and saves the output
to JSON files with the given prefix.
"""
# Create the directory if it doesn't exist
os.makedirs('openai_testing', exist_ok=True)
# Default values in case tests fail
global DEMO_VECTOR_STORE_ID, DEMO_VECTOR_STORE_ID2
DEMO_VECTOR_STORE_ID = None
DEMO_VECTOR_STORE_ID2 = None
def test_idempotent_vector_store_creation():
"""
Test that creating a vector store with the same name is idempotent.
"""
log_and_print(Fore.BLUE, "Starting vector store creation test...")
try:
vector_store = client.vector_stores.create(
name=DEMO_VECTOR_STORE_NAME,
)
# Attempt to create the same vector store again
vector_store2 = client.vector_stores.create(
name=DEMO_VECTOR_STORE_NAME,
)
# Check instead of assert
if vector_store2.id != vector_store.id:
log_and_print(Fore.YELLOW, f"FAILED IDEMPOTENCY: the same VectorStore name for {prefix.upper()} does not return the same ID",
level=logging.WARNING)
else:
log_and_print(Fore.GREEN, f"PASSED IDEMPOTENCY: f{vector_store2.id} == {vector_store.id} the same VectorStore name for {prefix.upper()} returns the same ID")
vector_store_data = vector_store.to_dict()
log_and_print(Fore.WHITE, f"vector_stores.create = {json.dumps(vector_store_data, indent=2)}")
with open(f'openai_testing/{prefix}_vector_store_create.json', 'w') as f:
json.dump(vector_store_data, f, indent=2)
global DEMO_VECTOR_STORE_ID, DEMO_VECTOR_STORE_ID2
DEMO_VECTOR_STORE_ID = vector_store.id
DEMO_VECTOR_STORE_ID2 = vector_store2.id
return DEMO_VECTOR_STORE_ID, DEMO_VECTOR_STORE_ID2
except Exception as e:
log_and_print(Fore.RED, f"Idempotent vector store creation test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
# Create a fallback vector store ID if needed
if 'vector_store' in locals() and vector_store:
DEMO_VECTOR_STORE_ID = vector_store.id
return DEMO_VECTOR_STORE_ID, DEMO_VECTOR_STORE_ID2
def test_vector_store_list():
"""
Test listing vector stores.
"""
log_and_print(Fore.BLUE, "Starting vector store list test...")
try:
vector_stores = client.vector_stores.list()
# Check instead of assert
if not isinstance(vector_stores, pagination.SyncCursorPage):
log_and_print(Fore.YELLOW, f"FAILED: Expected a list of vector stores, got {type(vector_stores)}",
level=logging.WARNING)
else:
log_and_print(Fore.GREEN, "Vector store list test passed!")
vector_stores_data = vector_stores.to_dict()
log_and_print(Fore.WHITE, f"vector_stores.list = {json.dumps(vector_stores_data, indent=2)}")
with open(f'openai_testing/{prefix}_vector_store_list.json', 'w') as f:
json.dump(vector_stores_data, f, indent=2)
except Exception as e:
log_and_print(Fore.RED, f"Vector store list test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
def test_retrieve_vector_store():
"""
Test retrieving a specific vector store.
"""
log_and_print(Fore.BLUE, "Starting retrieve vector store test...")
if not DEMO_VECTOR_STORE_ID:
log_and_print(Fore.YELLOW, "Skipping retrieve vector store test - no vector store ID available",
level=logging.WARNING)
return
try:
vector_store = client.vector_stores.retrieve(
vector_store_id=DEMO_VECTOR_STORE_ID,
)
# Check instead of assert
if vector_store.id != DEMO_VECTOR_STORE_ID:
log_and_print(Fore.YELLOW, "FAILED: Retrieved vector store ID does not match", level=logging.WARNING)
else:
log_and_print(Fore.GREEN, "Retrieve vector store test passed!")
vector_store_data = vector_store.to_dict()
log_and_print(Fore.WHITE, f"vector_stores.retrieve = {json.dumps(vector_store_data, indent=2)}")
with open(f'openai_testing/{prefix}_vector_store_retrieve.json', 'w') as f:
json.dump(vector_store_data, f, indent=2)
except Exception as e:
log_and_print(Fore.RED, f"Retrieve vector store test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
def test_modify_vector_store():
"""
Test modifying a vector store.
"""
log_and_print(Fore.BLUE, "Starting modify vector store test...")
if not DEMO_VECTOR_STORE_ID:
log_and_print(Fore.YELLOW, "Skipping modify vector store test - no vector store ID available",
level=logging.WARNING)
return
try:
updated_vector_store = client.vector_stores.update(
vector_store_id=DEMO_VECTOR_STORE_ID,
name="Updated Support FAQ FJA",
)
# Check instead of assert
if updated_vector_store.name != "Updated Support FAQ FJA":
log_and_print(Fore.YELLOW, "FAILED: Vector store name was not updated correctly", level=logging.WARNING)
else:
log_and_print(Fore.GREEN, "Modify vector store test passed!")
updated_vector_store_data = updated_vector_store.to_dict()
log_and_print(Fore.WHITE, f"vector_stores.modify = {json.dumps(updated_vector_store_data, indent=2)}")
with open(f'openai_testing/{prefix}_vector_store_modify.json', 'w') as f:
json.dump(updated_vector_store_data, f, indent=2)
except Exception as e:
log_and_print(Fore.RED, f"Modify vector store test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
def test_delete_vector_store():
"""
Test deleting a vector store.
"""
log_and_print(Fore.BLUE, "Starting delete vector store test...")
if not DEMO_VECTOR_STORE_ID2:
log_and_print(Fore.YELLOW, "Skipping delete vector store test - no second vector store ID available",
level=logging.WARNING)
return
try:
response = client.vector_stores.delete(
vector_store_id=DEMO_VECTOR_STORE_ID2,
)
log_and_print(Fore.GREEN, "Delete vector store test passed!")
response_data = response.to_dict()
log_and_print(Fore.WHITE, f"Vector store delete response = {json.dumps(response_data, indent=2)}")
with open(f'openai_testing/{prefix}_vector_store_delete.json', 'w') as f:
json.dump(response_data, f, indent=2)
except Exception as e:
log_and_print(Fore.RED, f"Delete vector store test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
def test_create_vector_store_file():
log_and_print(Fore.BLUE, "Starting create vector store file test...")
if not DEMO_VECTOR_STORE_ID:
log_and_print(Fore.YELLOW, "Skipping create vector store file test - no vector store ID available",
level=logging.WARNING)
return
try:
# create jsonl of files as an example
with open("mydata.jsonl", "w") as f:
f.write('{"text": "What is the return policy?", "metadata": {"category": "support"}}\n')
f.write('{"text": "How do I reset my password?", "metadata": {"category": "support"}}\n')
f.write('{"text": "Where can I find my order history?", "metadata": {"category": "support"}}\n')
f.write('{"text": "What are the shipping options?", "metadata": {"category": "support"}}\n')
f.write('{"text": "What is your favorite banana?", "metadata": {"category": "support"}}\n')
# Create a simple text file if my_data_small.txt doesn't exist
if not os.path.exists("my_data_small.txt"):
with open("my_data_small.txt", "w") as f:
f.write("This is a test file for vector store testing.\n")
created_file = client.files.create(
file=open("my_data_small.txt", "rb"),
purpose="assistants",
)
created_file_data = created_file.to_dict()
log_and_print(Fore.WHITE, f"Created file {json.dumps(created_file_data, indent=2)}")
with open(f'openai_testing/{prefix}_file_create.json', 'w') as f:
json.dump(created_file_data, f, indent=2)
retrieved_files = client.files.retrieve(created_file.id)
retrieved_files_data = retrieved_files.to_dict()
log_and_print(Fore.WHITE, f"Retrieved file {json.dumps(retrieved_files_data, indent=2)}")
with open(f'openai_testing/{prefix}_file_retrieve.json', 'w') as f:
json.dump(retrieved_files_data, f, indent=2)
vector_store_file = client.vector_stores.files.create(
vector_store_id=DEMO_VECTOR_STORE_ID,
file_id=created_file.id,
)
log_and_print(Fore.GREEN, "Create vector store file test passed!")
except Exception as e:
log_and_print(Fore.RED, f"Create vector store file test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
def test_search_vector_store():
"""
Test searching a vector store.
"""
log_and_print(Fore.BLUE, "Starting search vector store test...")
if not DEMO_VECTOR_STORE_ID:
log_and_print(Fore.YELLOW, "Skipping search vector store test - no vector store ID available",
level=logging.WARNING)
return
try:
query = "What is the banana policy?"
search_results = client.vector_stores.search(
vector_store_id=DEMO_VECTOR_STORE_ID,
query=query,
max_num_results=10,
ranking_options={
'ranker': 'default-2024-11-15',
'score_threshold': 0.0,
},
rewrite_query=False,
)
# Check instead of assert
if not isinstance(search_results, pagination.SyncPage):
log_and_print(Fore.YELLOW, f"FAILED: Expected a list of search results, got {type(search_results)}",
level=logging.WARNING)
else:
log_and_print(Fore.GREEN, "Search vector store test passed!")
search_results_dict = search_results.to_dict()
log_and_print(Fore.WHITE, f"Search results = {search_results_dict}")
with open(f'openai_testing/{prefix}_vector_store_search.json', 'w') as f:
json.dump(search_results_dict, f, indent=2)
log_and_print(Fore.WHITE, f"vector_stores.search = {search_results.to_json()}")
except Exception as e:
log_and_print(Fore.RED, f"Search vector store test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
# Run all tests in sequence, even if some fail
test_results = []
try:
result = test_idempotent_vector_store_creation()
if result and len(result) == 2:
DEMO_VECTOR_STORE_ID, DEMO_VECTOR_STORE_ID2 = result
test_results.append(True)
except Exception as e:
log_and_print(Fore.RED, f"Vector store creation test failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
test_results.append(False)
for test_func in [
test_vector_store_list,
test_retrieve_vector_store,
test_modify_vector_store,
test_delete_vector_store,
test_create_vector_store_file,
test_search_vector_store
]:
try:
test_func()
test_results.append(True)
except Exception as e:
log_and_print(Fore.RED, f"{test_func.__name__} failed: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
test_results.append(False)
if all(test_results):
log_and_print(Fore.GREEN, f"All {prefix} tests completed successfully!")
else:
failed_count = test_results.count(False)
log_and_print(Fore.YELLOW, f"{failed_count} {prefix} test(s) failed, but script completed.")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run OpenAI and/or LlamaStack tests.")
parser.add_argument(
"--provider",
type=str,
default="llama",
choices=["openai", "llama", "both"],
help="Specify which environment to test: openai, llama, or both. Default is both.",
)
args = parser.parse_args()
try:
if args.provider in ("openai", "both"):
openai_client = OpenAI()
run_tests(openai_client, prefix="openai")
if args.provider in ("llama", "both"):
llama_client = OpenAI(base_url="http://localhost:8321/v1/openai/v1", api_key="none")
run_tests(llama_client, prefix="llama")
log_and_print(Fore.GREEN, "All tests completed!")
except Exception as e:
log_and_print(Fore.RED, f"Tests failed to complete: {e}", level=logging.ERROR)
logging.error(traceback.format_exc())
```
</details>
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
This PR adds the keyword search implementation for Milvus. Along with
the implementation for remote Milvus, the tests require us to start a
Milvus containers locally.
In order to verify the implementation, run:
```
pytest tests/unit/providers/vector_io/remote/test_milvus.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
```
You can also test the changes using the below script:
```
#!/usr/bin/env python3
import asyncio
import os
import uuid
from typing import List
from llama_stack_client import (
Agent,
AgentEventLogger,
LlamaStackClient,
RAGDocument
)
class MilvusRAGDemo:
def __init__(self, base_url: str = "http://localhost:8321/"):
self.client = LlamaStackClient(base_url=base_url)
self.vector_db_id = f"milvus_rag_demo_{uuid.uuid4().hex[:8]}"
self.model_id = None
self.embedding_model_id = None
self.embedding_dimension = None
def setup_models(self):
"""Get available models and select appropriate ones for LLM and embeddings."""
models = self.client.models.list()
# Select embedding model
embedding_models = [m for m in models if m.model_type == "embedding"]
if not embedding_models:
raise ValueError("No embedding models found")
self.embedding_model_id = embedding_models[0].identifier
self.embedding_dimension = embedding_models[0].metadata["embedding_dimension"]
def register_vector_db(self):
print(f"Registering Milvus vector database: {self.vector_db_id}")
response = self.client.vector_dbs.register(
vector_db_id=self.vector_db_id,
embedding_model=self.embedding_model_id,
embedding_dimension=self.embedding_dimension,
provider_id="milvus-remote", # Use remote Milvus
)
print(f"Vector database registered successfully")
return response
def insert_documents(self):
"""Insert sample documents into the vector database."""
print("\nInserting sample documents...")
# Sample documents about different topics
documents = [
RAGDocument(
document_id="ai_ml_basics",
content="""
Artificial Intelligence (AI) and Machine Learning (ML) are transforming the world.
AI refers to the simulation of human intelligence in machines, while ML is a subset
of AI that enables computers to learn and improve from experience without being
explicitly programmed. Deep learning, a subset of ML, uses neural networks with
multiple layers to process complex patterns in data.
Key concepts in AI/ML include:
- Supervised Learning: Training with labeled data
- Unsupervised Learning: Finding patterns in unlabeled data
- Reinforcement Learning: Learning through trial and error
- Neural Networks: Computing systems inspired by biological brains
""",
mime_type="text/plain",
metadata={"topic": "technology", "category": "ai_ml"},
),
]
# Insert documents with chunking
self.client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=self.vector_db_id,
chunk_size_in_tokens=200, # Smaller chunks for better granularity
)
print(f"Inserted {len(documents)} documents with chunking")
def test_keyword_search(self):
"""Test keyword-based search using BM25."""
queries = [
"neural networks",
"Python frameworks",
"data cleaning",
]
for query in queries:
response = self.client.vector_io.query(
vector_db_id=self.vector_db_id,
query=query,
params={
"mode": "keyword", # Keyword search
"max_chunks": 3,
"score_threshold": 0.0,
}
)
for i, (chunk, score) in enumerate(zip(response.chunks, response.scores)):
print(f" {i+1}. Score: {score:.4f}")
print(f" Content: {chunk.content[:100]}...")
print(f" Metadata: {chunk.metadata}")
def run_demo(self):
try:
self.setup_models()
self.register_vector_db()
self.insert_documents()
self.test_keyword_search()
except Exception as e:
print(f"Error during demo: {e}")
raise
def main():
"""Main function to run the demo."""
# Check if Llama Stack server is running
demo = MilvusRAGDemo()
try:
demo.run_demo()
except Exception as e:
print(f"Demo failed: {e}")
if __name__ == "__main__":
main()
```
[//]: # (## Documentation)
---------
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
- fireworks, together do not support Llama-guard 3 8b model anymore
- Need to default to ollama
- current safety shields logic was not correct since the shield_id was
the provider ( which had duplicates )
- Followed similar logic to models
Note: Seems a bit over-engineered but this can now be extended to other
providers and fits in the overall mechanism of how env_vars are used to
manage starter.
### How to test
```
ENABLE_OLLAMA=ollama ENABLE_FIREWORKS=fireworks SAFETY_MODEL=llama-guard3:1b pytest -s -v tests/integration/ --stack-config starter -k 'not(supervised_fine_tune or builtin_tool_code or safety_with_image or code_interpreter_for or rag_and_code or truncation or register_and_unregister)' --text-model fireworks/meta-llama/Llama-3.3-70B-Instruct --vision-model fireworks/meta-llama/Llama-4-Scout-17B-16E-Instruct --safety-shield llama-guard3:1b --embedding-model all-MiniLM-L6-v2
```
### Related but not obvious in this PR
In the llama-stack-ops repo, we run tests before publishing packages and
docker containers.
The actions in that repo were using the fireworks / together distros (
which are non-existent )
So need to update that to run with `starter` and use `ollama`
specifically for safety.
# What does this PR do?
inference providers each have a static list of supported / known models.
some also have access to a dynamic list of currently available models.
this change gives prodivers using the ModelRegistryHelper the ability to
combine their static and dynamic lists.
for instance, OpenAIInferenceAdapter can implement
```
def query_available_models(self) -> list[str]:
return [entry.model for entry in self.openai_client.models.list()]
```
to augment its static list w/ a current list from openai.
## Test Plan
scripts/unit-test.sh
Remove both the metadata and content from the kvstore when a file is
being removed from the vector store.
Closes: #2685
Also add faiss provider to openai_vector_stores test suite
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: raghotham <rsm@meta.com>
# What does this PR do?
Some of our inference providers support passthrough authentication via
`x-llamastack-provider-data` header values. This fixes the providers
that support passthrough auth to not cache their clients to the backend
providers (mostly OpenAI client instances) so that the client connecting
to Llama Stack has to provide those auth values on each and every
request.
## Test Plan
I added some unit tests to ensure we're not caching clients across
requests for all the fixed providers in this PR.
```
uv run pytest -sv tests/unit/providers/inference/test_inference_client_caching.py
```
I also ran some of our OpenAI compatible API integration tests for each
of the changed providers, just to ensure they still work. Note that
these providers don't actually pass all these tests (for unrelated
reasons due to quirks of the Groq and Together SaaS services), but
enough of the tests passed to confirm the clients are still working as
intended.
### Together
```
ENABLE_TOGETHER="together" \
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv \
tests/integration/inference/test_openai_completion.py \
--text-model "together/meta-llama/Llama-3.1-8B-Instruct"
```
### OpenAI
```
ENABLE_OPENAI="openai" \
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv \
tests/integration/inference/test_openai_completion.py \
--text-model "openai/gpt-4o-mini"
```
### Groq
```
ENABLE_GROQ="groq" \
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv \
tests/integration/inference/test_openai_completion.py \
--text-model "groq/meta-llama/Llama-3.1-8B-Instruct"
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Update the shield register validation of Sambanova not to raise, but
only warn when a model is not available in the base url endpoint used,
also added warnings when model is not available in the base url endpoint
used
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
run starter distro with Sambanova enabled
# What does this PR do?
The current authorized sql store implementation does not respect
user.principal (only checks attributes). This PR addresses that.
## Test Plan
Added test cases to integration tests.
# What does this PR do?
This PR refactors and the VectorIO backend logic for `sqlite-vec` and
adds unit tests and fixtures to make it easy to test both `sqlite-vec`
and `milvus`.
Key changes:
- `sqlite-vec` migrated to `kvstore` registry
- added in-memory cache for sqlite-vec to be consistent with `milvus`
- default fixtures moved to `conftest.py`
- removed redundant tests from sqlite`-vec`
- made `test_vector_io_openai_vector_stores.py` more easily extensible
## Test Plan
Unit tests added testing inline providers.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
While investigating the `uv.lock` changes made in
https://github.com/meta-llama/llama-stack/pull/2695 I noticed several of
the pre-commit hook versions were out of date
This PR updates them and fixes some new `ruff` errors
---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
We are now testing the safety capability with the starter image. This
includes a few changes:
* Enable the safety integration test
* Relax the shield model requirements from llama-guard to make it work
with llama-guard3:8b coming from Ollama
* Expose a shield for each inference provider in the starter distro. The
shield will only be registered if the provider is enabled.
Closes: https://github.com/meta-llama/llama-stack/issues/2528
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
- Fix constructor call missing files_api parameter
- Add kvstore field to MilvusVectorIOConfig
- Resolves#2626
# What does this PR do?
[https://github.com/meta-llama/llama-stack/issues/2626]
## Problem
The `MilvusVectorIOAdapter` fails to initialize due to two missing
configuration issues:
1. Missing `files_api` parameter in the constructor call
2. Missing `kvstore` field in the `MilvusVectorIOConfig` class
## Root Cause
1. The adapter constructor expects 3 parameters `(config, inference_api,
files_api)` but the `get_adapter_impl` function only passes 2 parameters
2. The `MilvusVectorIOConfig` class lacks the `kvstore` field that the
adapter's `initialize()` method expects for metadata persistence
## Solution
- Added `files_api = deps.get(Api.files, None)` to safely retrieve files
API from dependencies
- Pass the files_api parameter to MilvusVectorIOAdapter constructor
- Added `kvstore: KVStoreConfig | None = None` field to
MilvusVectorIOConfig
- Maintains backward compatibility since both files_api and kvstore can
be None
Closes#2626
## Test Plan
- [x] Tested with Milvus configuration - server starts successfully
```yaml
vector_io:
- provider_id: milvus
provider_type: remote::milvus
config:
uri: http://localhost:19530
token: root:Milvus
kvstore:
type: sqlite
namespace: null
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/remote-vllm}/milvus_store.db
```
- [x] Vector operations work as expected
```python
from llama_stack_client import LlamaStackClient
from llama_stack_client.types.shared_params.document import Document as RAGDocument
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger as AgentEventLogger
import os
endpoint = os.getenv("LLAMA_STACK_ENDPOINT")
model = os.getenv("INFERENCE_MODEL")
# Initialize the client
client = LlamaStackClient(base_url=endpoint)
vector_db_id = "my_documents"
response = client.vector_dbs.register(
vector_db_id=vector_db_id,
embedding_model="all-MiniLM-L6-v2",
embedding_dimension=384,
provider_id="milvus",
)
urls = ["getting_started/Red_Hat_AI_Inference_Server-3.0-Getting_started-en-US.pdf", "vllm_server_arguments/Red_Hat_AI_Inference_Server-3.0-vLLM_server_arguments-en-US.pdf"]
documents = [
RAGDocument(
document_id=f"num-{i}",
content=f"https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/3.0/pdf/{url}",
mime_type="application/pdf",
metadata={},
)
for i, url in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=512,
)
rag_agent = Agent(
client,
model=model,
# Define instructions for the agent (system prompt)
instructions="You are a helpful assistant",
enable_session_persistence=False,
# Define tools available to the agent
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {
"vector_db_ids": [vector_db_id],
},
}
],
)
session_id = rag_agent.create_session("test-session")
user_prompts = [
"How to start the AI Inference Server container image? use the knowledge_search tool to get information.",
]
for prompt in user_prompts:
print(f"User> {prompt}")
response = rag_agent.create_turn(
messages=[{"role": "user", "content": prompt}],
session_id=session_id,
)
for log in AgentEventLogger().log(response):
log.print()
```
server logs:
```
INFO 2025-07-04 22:18:30,385 __main__:577 server: Listening on ['::', '0.0.0.0']:5000
INFO: Started server process [769725]
INFO: Waiting for application startup.
INFO 2025-07-04 22:18:30,390 __main__:158 server: Starting up
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)
INFO 2025-07-04 22:18:52,193 llama_stack.distribution.routing_tables.common:200 core: Setting owner for vector_db 'my_documents' to
20:18:52.194 [START] /v1/vector-dbs
INFO: 192.168.1.249:64170 - "POST /v1/vector-dbs HTTP/1.1" 200 OK
20:18:52.216 [END] /v1/vector-dbs [StatusCode.OK] (21.89ms)
20:18:52.222 [START] /v1/tool-runtime/rag-tool/insert
INFO 2025-07-04 22:18:56,265 llama_stack.providers.utils.inference.embedding_mixin:102 uncategorized: Loading sentence transformer for
all-MiniLM-L6-v2...
WARNING 2025-07-04 22:18:59,214 opentelemetry.trace:537 uncategorized: Overriding of current TracerProvider is not allowed
INFO 2025-07-04 22:18:59,339 sentence_transformers.SentenceTransformer:219 uncategorized: Use pytorch device_name: cuda:0
INFO 2025-07-04 22:18:59,340 sentence_transformers.SentenceTransformer:227 uncategorized: Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO: 192.168.1.249:64170 - "POST /v1/tool-runtime/rag-tool/insert HTTP/1.1" 200 OK
INFO: 192.168.1.249:64170 - "POST /v1/agents HTTP/1.1" 200 OK
INFO: 192.168.1.249:64170 - "GET /v1/tools?toolgroup_id=builtin%3A%3Arag%2Fknowledge_search HTTP/1.1" 200 OK
INFO: 192.168.1.249:64170 - "POST /v1/agents/b1f6f063-1691-4780-8d9e-facd81708b91/session HTTP/1.1" 200 OK
20:19:01.834 [END] /v1/tool-runtime/rag-tool/insert [StatusCode.OK] (9612.06ms)
20:19:01.839 [START] /v1/agents
INFO: 192.168.1.249:64170 - "POST /v1/agents/b1f6f063-1691-4780-8d9e-facd81708b91/session/d2706302-bb54-421d-a890-5e25df9cb47f/turn HTTP/1.1" 200 OK
20:19:01.839 [END] /v1/agents [StatusCode.OK] (0.18ms)
20:19:01.844 [START] /v1/tools
INFO 2025-07-04 22:19:01,853 llama_stack.providers.remote.inference.vllm.vllm:330 uncategorized: Initializing vLLM client with
base_url=http://192.168.1.183:8080/v1
20:19:01.858 [END] /v1/tools [StatusCode.OK] (14.92ms)
20:19:01.868 [START] /v1/agents/{agent_id}/session
20:19:01.868 [END] /v1/agents/{agent_id}/session [StatusCode.OK] (0.37ms)
20:19:01.873 [START] /v1/agents/{agent_id}/session/{session_id}/turn
20:19:01.885 [START] inference
20:19:05.506 [END] inference [StatusCode.OK] (3621.19ms)
INFO 2025-07-04 22:19:05,537 llama_stack.providers.inline.agents.meta_reference.agent_instance:890 agents: executing tool call: knowledge_search
with args: {'query': 'How to start the AI Inference Server container image'}
20:19:05.538 [START] tool_execution
20:19:05.928 [END] tool_execution [StatusCode.OK] (390.08ms)
20:19:05.538 [INFO] executing tool call: knowledge_search with args: {'query': 'How to start the AI Inference Server container image'}
20:19:05.935 [START] inference
20:19:17.539 [END] inference [StatusCode.OK] (11603.76ms)
20:19:17.560 [END] /v1/agents/{agent_id}/session/{session_id}/turn [StatusCode.OK] (15686.62ms)
```
- [x] No regressions in functionality
- [x] Configuration properly accepts kvstore settings
---------
Co-authored-by: Peter Gustafsson <peter.gustafsson6@gmail.com>
Co-authored-by: raghotham <rsm@meta.com>
Co-authored-by: Francisco Arceo <farceo@redhat.com>
# What does this PR do?
- Enabling Unit tests for Milvus to start to test OpenAI compatibility
and fixing a few bugs.
- Also fixed an inconsistency in the Milvus config between remote and
inline.
- Added pymilvus to extras for testing in CI
I'm going to refactor this later to include the other inline providers
so that we can catch issues sooner.
I have another PR where I've been testing to find other bugs in the
implementation (and required changes drafted here:
https://github.com/meta-llama/llama-stack/pull/2617).
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- we are using `all-minilm:l6-v2` but the model we download from ollama
is `all-minilm:latest`
latest: https://ollama.com/library/all-minilm:latest 1b226e2802db
l6-v2: https://ollama.com/library/all-minilm:l6-v2 pin 1b226e2802db
- even currently they are exactly the same model but if
[all-minilm:l12-v2](https://ollama.com/library/all-minilm:l12-v2) is
updated, "latest" might not be the same for l6-v2.
- the only change in this PR is pin the model id in ollama
- also update detailed_tutorial with "starter" to replace deprecated
"ollama".
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
```
>INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
>llama stack build --run --template ollama --image-type venv
...
Build Successful!
You can find the newly-built template here: /home/wenzhou/zdtsw-forking/lls/llama-stack/llama_stack/templates/ollama/run.yaml
....
- metadata:
embedding_dimension: 384
model_id: all-MiniLM-L6-v2
model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType
- embedding
provider_id: ollama
provider_model_id: all-minilm:l6-v2
...
```
test
```
>llama-stack-client inference chat-completion --message "Write me a 2-sentence poem about the moon"
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/models "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"
OpenAIChatCompletion(
id='chatcmpl-04f99071-3da2-44ba-a19f-03b5b7fc70b7',
choices=[
OpenAIChatCompletionChoice(
finish_reason='stop',
index=0,
message=OpenAIChatCompletionChoiceMessageOpenAIAssistantMessageParam(
role='assistant',
content="Here is a 2-sentence poem about the moon:\n\nSilver crescent in the midnight sky,\nLuna's gentle face, a beauty to the eye.",
name=None,
tool_calls=None,
refusal=None,
annotations=None,
audio=None,
function_call=None
),
logprobs=None
)
],
created=1751644429,
model='llama3.2:3b-instruct-fp16',
object='chat.completion',
service_tier=None,
system_fingerprint='fp_ollama',
usage={'completion_tokens': 33, 'prompt_tokens': 36, 'total_tokens': 69, 'completion_tokens_details': None, 'prompt_tokens_details': None}
)
```
---------
Signed-off-by: Wen Zhou <wenzhou@redhat.com>