# What does this PR do?
Move to use vector_stores.search for file search tool in Responses,
which supports filters.
closes#2435
## Test Plan
Added e2e test with fitlers.
myenv ❯ llama stack run llama_stack/templates/fireworks/run.yaml
pytest -sv tests/verifications/openai_api/test_responses.py \
-k 'file_search and filters' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=meta-llama/Llama-3.3-70B-Instruct
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
To add health check for faiss inline vector_io provider.
I tried adding `async def health(self) -> HealthResponse:` like in
inference provider, but it didn't worked for `inline->vector_io->faiss`
provider. And via debug logs, I understood the critical issue, that the
health responses are being stored with the API name as the key, not as a
nested dictionary with provider IDs. This means that all providers of
the same API type (e.g., "vector_io") will share the same health
response, and only the last one processed will be visible in the API
response.
I've created a patch file that fixes this issue by:
- Storing the original get_providers_health method
- Creating a patched version that correctly maps health responses to
providers
- Applying the patch to the `ProviderImpl` class
Not an expert, so please let me know, if there can be any other
workaround using which I can get the health status updated directly from
`faiss.py`.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Added unit tests to test the provider patch implementation in the PR.
Adding a screenshot with the FAISS inline vector_io health status as
"OK"

…path
# What does this PR do?
Closes#1847
Changes:
- llama_stack/apis/common/responses.py: adds optional `url` field to
PaginatedResponse
- llama_stack/distribution/server/server.py: automatically populate the
URL field with route path
## Test Plan
- Built and ran llama stack server using the following cmds:
```bash
export INFERENCE_MODEL=llama3.1:8b
llama stack build --run --template ollama --image-type container
llama stack run llama_stack/templates/ollama/run.yaml
```
- Ran `curl` to test if we are seeing the `url` param in response:
```bash
curl -X 'GET' \
'http://localhost:8321/v1/agents' \
-H 'accept: application/json'
```
- Expected and Received Output:
`{"data":[],"has_more":false,"url":"/v1/agents"}`
---------
Co-authored-by: Rohan Awhad <rawhad@redhat.com>
For code completion apps need "fill in the middle" capabilities.
Added option of `suffix` to `openai_completion` to enable this.
Updated ollama provider to showcase the same.
### Test Plan
```
pytest -sv --stack-config="inference=ollama" tests/integration/inference/test_openai_completion.py --text-model qwen2.5-coder:1.5b -k test_openai_completion_non_streaming_suffix
```
### OpenAI Sample script
```
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1/openai/v1")
response = client.completions.create(
model="qwen2.5-coder:1.5b",
prompt="The capital of ",
suffix="is Paris.",
max_tokens=10,
)
print(response.choices[0].text)
```
### Output
```
France is ____.
To answer this question, we
```
# What does this PR do?
Add support for hybrid search mode in SQLite-vec provider, which
combines
keyword and vector search for better results. The implementation:
- Adds hybrid search mode as a new option alongside vector and keyword
search
- Implements query_hybrid method in SQLiteVecIndex that:
- First performs keyword search to get candidate matches
- Then applies vector similarity search on those candidates
- Updates documentation to reflect the new search mode
This change improves search quality by leveraging both semantic
similarity
and keyword matching, while maintaining backward compatibility with
existing
vector and keyword search modes.
## Test Plan
```
pytest tests/unit/providers/vector_io/test_sqlite_vec.py -v -s --tb=short
/Users/vnarsing/miniconda3/envs/stack-client/lib/python3.10/site-packages/pytest_asyncio/plugin.py:217: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=============================================================================================== test session starts ===============================================================================================
platform darwin -- Python 3.10.16, pytest-8.3.5, pluggy-1.5.0 -- /Users/vnarsing/miniconda3/envs/stack-client/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform': 'macOS-14.7.6-arm64-arm-64bit', 'Packages': {'pytest': '8.3.5', 'pluggy': '1.5.0'}, 'Plugins': {'html': '4.1.1', 'json-report': '1.5.0', 'timeout': '2.4.0', 'metadata': '3.1.1', 'anyio': '4.8.0', 'asyncio': '0.26.0', 'nbval': '0.11.0', 'cov': '6.1.1'}}
rootdir: /Users/vnarsing/go/src/github/meta-llama/llama-stack
configfile: pyproject.toml
plugins: html-4.1.1, json-report-1.5.0, timeout-2.4.0, metadata-3.1.1, anyio-4.8.0, asyncio-0.26.0, nbval-0.11.0, cov-6.1.1
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 10 items
tests/unit/providers/vector_io/test_sqlite_vec.py::test_add_chunks PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_vector PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_full_text_search PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_full_text_search_k_greater_than_results PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_chunk_id_conflict PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_generate_chunk_id PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_no_keyword_matches PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_score_threshold PASSED
tests/unit/providers/vector_io/test_sqlite_vec.py::test_query_chunks_hybrid_different_embedding PASSED
```
---------
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
This is an initial working prototype of wiring up the `file_search`
builtin tool for the Responses API to our existing rag knowledge search
tool.
This is me seeing what I could pull together on top of the bits we
already have merged. This may not be the ideal way to implement this,
and things like how I shuffle the vector store ids from the original
response API tool request to the actual tool execution feel a bit hacky
(grep for `tool_kwargs["vector_db_ids"]` in `_execute_tool_call` to see
what I mean).
## Test Plan
I stubbed in some new tests to exercise this using text and pdf
documents.
Note that this is currently under tests/verification only because it
sometimes flakes with tool calling of the small Llama-3.2-3B model we
run in CI (and that I use as an example below). We'd want to make the
test a bit more robust in some way if we moved this over to
tests/integration and ran it in CI.
### OpenAI SaaS (to verify test correctness)
```
pytest -sv tests/verifications/openai_api/test_responses.py \
-k 'file_search' \
--base-url=https://api.openai.com/v1 \
--model=gpt-4o
```
### Fireworks with faiss vector store
```
llama stack run llama_stack/templates/fireworks/run.yaml
pytest -sv tests/verifications/openai_api/test_responses.py \
-k 'file_search' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=meta-llama/Llama-3.3-70B-Instruct
```
### Ollama with faiss vector store
This sometimes flakes on Ollama because the quantized small model
doesn't always choose to call the tool to answer the user's question.
But, it often works.
```
ollama run llama3.2:3b
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run ./llama_stack/templates/ollama/run.yaml \
--image-type venv \
--env OLLAMA_URL="http://0.0.0.0:11434"
pytest -sv tests/verifications/openai_api/test_responses.py \
-k'file_search' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=meta-llama/Llama-3.2-3B-Instruct
```
### OpenAI provider with sqlite-vec vector store
```
llama stack run ./llama_stack/templates/starter/run.yaml --image-type venv
pytest -sv tests/verifications/openai_api/test_responses.py \
-k 'file_search' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model=openai/gpt-4o-mini
```
### Ensure existing vector store integration tests still pass
```
ollama run llama3.2:3b
INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run ./llama_stack/templates/ollama/run.yaml \
--image-type venv \
--env OLLAMA_URL="http://0.0.0.0:11434"
LLAMA_STACK_CONFIG=http://localhost:8321 \
pytest -sv tests/integration/vector_io \
--text-model "meta-llama/Llama-3.2-3B-Instruct" \
--embedding-model=all-MiniLM-L6-v2
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
This PR adds OpenAI compatibility for Ollama embeddings. Closes
https://github.com/meta-llama/llama-stack/issues/2428
Summary of changes:
- `llama_stack/providers/remote/inference/ollama/ollama.py`
- Implements the OpenAI embeddings endpoint for Ollama, replacing the
NotImplementedError with a full function that validates the model,
prepares parameters, calls the client, encodes embedding data
(optionally in base64), and returns a correctly structured response.
- Updates import statements to include the new embedding response
utilities.
- `llama_stack/providers/utils/inference/litellm_openai_mixin.py`
- Refactors the embedding data encoding logic to use a new shared
utility (`b64_encode_openai_embeddings_response`) instead of inline
base64 encoding and packing logic.
- Cleans up imports accordingly.
- `llama_stack/providers/utils/inference/openai_compat.py`
- Adds `b64_encode_openai_embeddings_response` to handle encoding OpenAI
embedding outputs (including base64 support) in a reusable way.
- Adds `prepare_openai_embeddings_params` utility for standardizing
embedding parameter preparation.
- Updates imports to include the new embedding data class.
- `tests/integration/inference/test_openai_embeddings.py`
- Removes `"remote::ollama"` from the list of providers that skip OpenAI
embeddings tests, since support is now implemented.
## Note
There was one minor issue, which required me to override the
`OpenAIEmbeddingsResponse.model` name with
`self._get_model(model).identifier` name, which is very unsatisfying.
## Test Plan
Unit Tests and integration tests
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Though the jwks endpoint does not usually require authentication, it
does in a kubernetes cluster. While the cluster can be configured to
allow anonymous access to that endpoint, this avoids the need to do so.
https://github.com/meta-llama/llama-stack-client-python/pull/238 updated
llama-stack-client to also support Open AI endpoints for embeddings,
files, vector-stores. This updates the test to test all configs --
openai sdk, llama stack sdk and library-as-client.
Updated the `search` functionality return response to match openai.
## Test Plan
```
pytest -sv --stack-config=http://localhost:8321 tests/integration/vector_io/test_openai_vector_stores.py --embedding-model all-MiniLM-L6-v2
```
# What does this PR do?
Fixes provider weaviate `query_vector` function for when the distance
between the query embedding and an embedding within the vector db is 0
(identical vectors). Catches `ZeroDivisionError` and then sets `score`
to infinity, which represent maximum similarity.
<!-- If resolving an issue, uncomment and update the line below -->
Closes [#2381]
## Test Plan
Checkout this PR
Execute this code and there will no longer be a `ZeroDivisionError`
exception
```
from llama_stack_client import LlamaStackClient
base_url = "http://localhost:8321"
client = LlamaStackClient(base_url=base_url)
models = client.models.list()
embedding_model = (
em := next(m for m in models if m.model_type == "embedding")
).identifier
embedding_dimension = 384
_ = client.vector_dbs.register(
vector_db_id="foo_db",
embedding_model=embedding_model,
embedding_dimension=embedding_dimension,
provider_id="weaviate",
)
chunk = {
"content": "foo",
"mime_type": "text/plain",
"metadata": {
"document_id": "foo-id"
}
}
client.vector_io.insert(vector_db_id="foo_db", chunks=[chunk])
client.vector_io.query(vector_db_id="foo_db", query="foo")
```
Adding OpenAI compat `/v1/vector-store` apis.
This PR implements the `faiss` provider with followup PRs coming up for
other providers.
Added routes to create, update, delete, list vector stores.
Also added route to search a vector store
Inserting into vector stores is missing and will be a follow up diff.
### Test Plan
- Added new integration test for testing the faiss provider
```
pytest -sv --stack-config http://localhost:8321 tests/integration/vector_io/test_openai_vector_stores.py --embedding-model all-MiniLM-L6-v2
```
# What does this PR do?
This loosens up the tool call function name and arguments checks in
`tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls`
because the small models we use in CI cannot reliably get the tool call
function name or arguments exactly right.
Closes#2345
## Test Plan
I ran this flaking test in a loop, let it run many dozens of times, and
didn't observe any flakes after the changes. Previously it flaked quite
regularly.
```
while uv run pytest -s -v \
'tests/integration/inference/test_openai_completion.py::test_inference_store_tool_calls[llama_stack_client-txt=3B-False]' \
--stack-config=http://localhost:8321 \
--text-model="meta-llama/Llama-3.2-3B-Instruct" \
--embedding-model=all-MiniLM-L6-v2; do; sleep 0.1; done
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Adds try-catch to faiss `query_vector` function for when the distance
between the query embedding and an embedding within the vector db is 0
(identical vectors). Catches `ZeroDivisionError` and then appends `(1.0
/ sys.float_info.min)` to `scores` to represent maximum similarity.
<!-- If resolving an issue, uncomment and update the line below -->
Closes [#2381]
## Test Plan
Checkout this PR
Execute this code and there will no longer be a `ZeroDivisionError`
exception
```
from llama_stack_client import LlamaStackClient
base_url = "http://localhost:8321"
client = LlamaStackClient(base_url=base_url)
models = client.models.list()
embedding_model = (
em := next(m for m in models if m.model_type == "embedding")
).identifier
embedding_dimension = 384
_ = client.vector_dbs.register(
vector_db_id="foo_db",
embedding_model=embedding_model,
embedding_dimension=embedding_dimension,
provider_id="faiss",
)
chunk = {
"content": "foo",
"mime_type": "text/plain",
"metadata": {
"document_id": "foo-id"
}
}
client.vector_io.insert(vector_db_id="foo_db", chunks=[chunk])
client.vector_io.query(vector_db_id="foo_db", query="foo")
```
### Running unit tests
`uv run pytest tests/unit/rag/test_rag_query.py -v`
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
To add health status check for remote VLLM
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
PR includes the unit test to test the added health check implementation
feature.
# What does this PR do?
Instead of downloading the models each time we now have a single Ollama
container that is baked with the models pulled and ready to use.
This will remove the CI flakiness on model pulling.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
## Test Plan
LLAMA_STACK_CONFIG=fireworks pytest -s -v
tests/integration/files/test_files.py::test_openai_client_basic_operations
The non-streaming version is just a small layer on top of the streaming
version - just pluck off the final `response.completed` event and return
that as the response!
This PR also includes a couple other changes which I ended up making
while working on it on a flight:
- changes to `ollama` so it does not pull embedding models
unconditionally
- a small fix to library client to make the stream and non-stream cases
a bit more symmetric
This allows a set of rules to be defined for determining access to
resources. The rules are (loosely) based on the cedar policy format.
A rule defines a list of action either to permit or to forbid. It may
specify a principal or a resource that must match for the rule to take
effect. It may also specify a condition, either a 'when' or an 'unless',
with additional constraints as to where the rule applies.
A list of rules is held for each type to be protected and tried in order
to find a match. If a match is found, the request is permitted or
forbidden depening on the type of rule. If no match is found, the
request is denied. If no rules are specified for a given type, a rule
that allows any action as long as the resource attributes match the user
attributes is added (i.e. the previous behaviour is the default.
Some examples in yaml:
```
model:
- permit:
principal: user-1
actions: [create, read, delete]
comment: user-1 has full access to all models
- permit:
principal: user-2
actions: [read]
resource: model-1
comment: user-2 has read access to model-1 only
- permit:
actions: [read]
when:
user_in: resource.namespaces
comment: any user has read access to models with matching attributes
vector_db:
- forbid:
actions: [create, read, delete]
unless:
user_in: role::admin
comment: only user with admin role can use vector_db resources
```
---------
Signed-off-by: Gordon Sim <gsim@redhat.com>
# What does this PR do?
This adds the missing `text` parameter to the Responses API that is how
users control structured outputs. All we do with that parameter is map
it to the corresponding chat completion response_format.
## Test Plan
The new unit tests exercise the various permutations allowed for this
property, while a couple of new verification tests actually use it for
real to verify the model outputs are following the format as expected.
Unit tests:
`python -m pytest -s -v
tests/unit/providers/agents/meta_reference/test_openai_responses.py`
Verification tests:
```
llama stack run llama_stack/templates/together/run.yaml
pytest -s -vv 'tests/verifications/openai_api/test_responses.py' \
--base-url=http://localhost:8321/v1/openai/v1 \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Note that the verification tests can only be run with a real Llama Stack
server (as opposed to using the library client via
`--provider=stack:together`) because the Llama Stack python client is
not yet updated to accept this text field.
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
TSIA
Added Files provider to the fireworks template. Might want to add to all
templates as a follow-up.
## Test Plan
llama-stack pytest tests/unit/files/test_files.py
llama-stack llama stack build --template fireworks --image-type conda
--run
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -s -v
tests/integration/files/
I think the implementation needs more simplification. Spent way too much
time trying to get the tests pass with models not co-operating :(
Finally had to switch claude-sonnet to get things to pass reliably.
### Test Plan
```
export TAVILY_SEARCH_API_KEY=...
export OPENAI_API_KEY=...
uv run pytest -p no:warnings \
-s -v tests/verifications/openai_api/test_responses.py \
--provider=stack:starter \
--model openai/gpt-4o
```
# What does this PR do?
The remote-vllm `test_chat_completion_doesnt_block_event_loop` unit test
was often failing for me on a Mac with a `httpx.ReadError`. I traced
this back to the swap to the `AsyncOpenAI` client in the remote-vllm
provider as where this started, and it looks like the async client needs
a bit more accurate HTTP request handling from our mock server.
So, this fixes that unit test to send proper Content-Type and
Content-Length headers which makes the `AsyncOpenAI` client happier on
Macs.
## Test Plan
All the test_remote_vllm.py unit tests consistently pass for me on a Mac
now, without any flaking in the event loop one.
`pytest -s -v tests/unit/providers/inference/test_remote_vllm.py`
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Adds a new endpoint that is compatible with OpenAI for embeddings api.
`/openai/v1/embeddings`
Added providers for OpenAI, LiteLLM and SentenceTransformer.
## Test Plan
```
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/inference/test_openai_embeddings.py --embedding-model all-MiniLM-L6-v2,text-embedding-3-small,gemini/text-embedding-004
```
# What does this PR do?
* Added support postgresql inference store
* Added 'oracle' template that demos how to config postgresql stores
(except for telemetry, which is not supported currently)
## Test Plan
llama stack build --template oracle --image-type conda --run
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -s -v tests/integration/
--text-model accounts/fireworks/models/llama-v3p3-70b-instruct -k
'inference_store'
We must store the full (re-hydrated) input not just the original input
in the Response object. Of course, this is not very space efficient and
we should likely find a better storage scheme so that we can only store
unique entries in the database and then re-hydrate them efficiently
later. But that can be done safely later.
Closes https://github.com/meta-llama/llama-stack/issues/2299
## Test Plan
Unit test
# What does this PR do?
Followup of https://github.com/meta-llama/llama-stack/pull/2287. We must
use `--group` when running commands with uv.
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Fix a bug in openai_compat where choices are not indexed correctly.
## Test Plan
Added a new test.
Rerun the failed inference_store tests:
llama stack run fireworks --image-type conda
pytest -s -v tests/integration/ --stack-config http://localhost:8321 -k
'test_inference_store' --text-model meta-llama/Llama-3.3-70B-Instruct
--count 10
# What does this PR do?
Changed the test to not require tool_call in output, but still keeping
the tools params there as a smoke test.
## Test Plan
Used llama3.3 from fireworks (same as CI)
<img width="1433" alt="image"
src="https://github.com/user-attachments/assets/1e5fca98-9b4f-402e-a0bc-d9f910f2c207"
/>
Run with ollama distro and 3b model.
This adds initial streaming support to the Responses API.
This PR makes sure that the _first_ inference call made to chat
completions streams out.
There's more to be done:
- tool call output tokens need to stream out when possible
- we need to loop through multiple rounds of inference and they all need
to stream out.
## Test Plan
Added a test. Executed as:
```
FIREWORKS_API_KEY=... \
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--provider=stack:fireworks --model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Then, started a llama stack fireworks distro and tested against it like
this:
```
OPENAI_API_KEY=blah \
pytest -s -v 'tests/verifications/openai_api/test_responses.py' \
--base-url http://localhost:8321/v1/openai/v1 \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
# What does this PR do?
This is not a core dependency of the distro server. It's only necessary
when using `inline::rag-runtime` or `inline::meta-reference` providers.
Signed-off-by: Sébastien Han <seb@redhat.com>
When registering a MCP endpoint, we cannot list tools (like we used to)
since the MCP endpoint may be behind an auth wall. Registration can
happen much sooner (via run.yaml).
Instead, we do listing only when the _user_ actually calls listing.
Furthermore, we cache the list in-memory in the server. Currently, the
cache is not invalidated -- we may want to periodically re-list for MCP
servers. Note that they must call `list_tools` before calling
`invoke_tool` -- we use this critically.
This will enable us to list MCP servers in run.yaml
## Test Plan
Existing tests, updated tests accordingly.
The test depends on llama's tool calling ability. In the CI, we run with
a small ollama model.
The fix might be to check for either message or function_call because
the model is flaky and we aren't really testing that behavior?
# What does this PR do?
This is not part of the official OpenAI API, but we'll use this for the
logs UI.
In order to support more filtering options, I'm adopting the newly
introduced sql store in in place of the kv store.
## Test Plan
Added integration/unit tests.
Having to run (and re-run) a server while running verifications can be
annoying while you are iterating on code. This makes it so you can use
the library client -- and because it is OpenAI client compatible, it all
works.
## Test Plan
```
pytest -s -v tests/verifications/openai_api/test_responses.py \
--provider=stack:together \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
The most interesting MCP servers are those with an authorization wall in
front of them. This PR uses the existing `provider_data` mechanism of
passing provider API keys for passing MCP access tokens (in fact,
arbitrary headers in the style of the OpenAI Responses API) from the
client through to the MCP server.
```
class MCPProviderDataValidator(BaseModel):
# mcp_endpoint => list of headers to send
mcp_headers: dict[str, list[str]] | None = None
```
Note how we must stuff the headers for all MCP endpoints into a single
"MCPProviderDataValidator". Unlike existing providers (e.g., Together
and Fireworks for inference) where we could name the provider api keys
clearly (`together_api_key`, `fireworks_api_key`), we cannot name these
keys for MCP. We have a single generic MCP provider which can serve
multiple "toolgroups". So we use a dict to combine all the headers for
all MCP endpoints you may want to use in an agentic call.
## Test Plan
See the added integration test for usage.
# What does this PR do?
* Provide sqlite implementation of the APIs introduced in
https://github.com/meta-llama/llama-stack/pull/2145.
* Introduced a SqlStore API: llama_stack/providers/utils/sqlstore/api.py
and the first Sqlite implementation
* Pagination support will be added in a future PR.
## Test Plan
Unit test on sql store:
<img width="1005" alt="image"
src="https://github.com/user-attachments/assets/9b8b7ec8-632b-4667-8127-5583426b2e29"
/>
Integration test:
```
INFERENCE_MODEL="llama3.2:3b-instruct-fp16" llama stack build --template ollama --image-type conda --run
```
```
LLAMA_STACK_CONFIG=http://localhost:5001 INFERENCE_MODEL="llama3.2:3b-instruct-fp16" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-fp16" -k 'inference_store and openai'
```
# What does this PR do?
This PR introduces support for keyword based FTS5 search with BM25
relevance scoring. It makes changes to the existing EmbeddingIndex base
class in order to support a search_mode and query_str parameter, that
can be used for keyword based search implementations.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
run
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
```
Output:
```
pytest llama_stack/providers/tests/vector_io/test_sqlite_vec.py -v -s --tb=short --disable-warnings --asyncio-mode=auto
/Users/vnarsing/miniconda3/envs/stack-client/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
====================================================== test session starts =======================================================
platform darwin -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /Users/vnarsing/miniconda3/envs/stack-client/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform': 'macOS-14.7.4-arm64-arm-64bit', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0'}}
rootdir: /Users/vnarsing/go/src/github/meta-llama/llama-stack
configfile: pyproject.toml
plugins: html-4.1.1, metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0
asyncio: mode=auto, asyncio_default_fixture_loop_scope=None
collected 7 items
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_add_chunks PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_vector PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_query_chunks_fts PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_chunk_id_conflict PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_register_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_unregister_vector_db PASSED
llama_stack/providers/tests/vector_io/test_sqlite_vec.py::test_generate_chunk_id PASSED
```
For reference, with the implementation, the fts table looks like below:
```
Chunk ID: 9fbc39ce-c729-64a2-260f-c5ec9bb2a33e, Content: Sentence 0 from document 0
Chunk ID: 94062914-3e23-44cf-1e50-9e25821ba882, Content: Sentence 1 from document 0
Chunk ID: e6cfd559-4641-33ba-6ce1-7038226495eb, Content: Sentence 2 from document 0
Chunk ID: 1383af9b-f1f0-f417-4de5-65fe9456cc20, Content: Sentence 3 from document 0
Chunk ID: 2db19b1a-de14-353b-f4e1-085e8463361c, Content: Sentence 4 from document 0
Chunk ID: 9faf986a-f028-7714-068a-1c795e8f2598, Content: Sentence 5 from document 0
Chunk ID: ef593ead-5a4a-392f-7ad8-471a50f033e8, Content: Sentence 6 from document 0
Chunk ID: e161950f-021f-7300-4d05-3166738b94cf, Content: Sentence 7 from document 0
Chunk ID: 90610fc4-67c1-e740-f043-709c5978867a, Content: Sentence 8 from document 0
Chunk ID: 97712879-6fff-98ad-0558-e9f42e6b81d3, Content: Sentence 9 from document 0
Chunk ID: aea70411-51df-61ba-d2f0-cb2b5972c210, Content: Sentence 0 from document 1
Chunk ID: b678a463-7b84-92b8-abb2-27e9a1977e3c, Content: Sentence 1 from document 1
Chunk ID: 27bd63da-909c-1606-a109-75bdb9479882, Content: Sentence 2 from document 1
Chunk ID: a2ad49ad-f9be-5372-e0c7-7b0221d0b53e, Content: Sentence 3 from document 1
Chunk ID: cac53bcd-1965-082a-c0f4-ceee7323fc70, Content: Sentence 4 from document 1
```
Query results:
Result 1: Sentence 5 from document 0
Result 2: Sentence 5 from document 1
Result 3: Sentence 5 from document 2
[//]: # (## Documentation)
---------
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
The cache_ttl config value is not in fact tied to the lifetime of any of
the keys, it represents the time interval between for our key cache
refresher.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Kubernetes since 1.20 exposes a JWKS endpoint that we can use with our
recent oauth2 recent implementation.
The CI test has been kept intact for validation.
Signed-off-by: Sébastien Han <seb@redhat.com>