- fireworks, together do not support Llama-guard 3 8b model anymore
- Need to default to ollama
- current safety shields logic was not correct since the shield_id was
the provider ( which had duplicates )
- Followed similar logic to models
Note: Seems a bit over-engineered but this can now be extended to other
providers and fits in the overall mechanism of how env_vars are used to
manage starter.
### How to test
```
ENABLE_OLLAMA=ollama ENABLE_FIREWORKS=fireworks SAFETY_MODEL=llama-guard3:1b pytest -s -v tests/integration/ --stack-config starter -k 'not(supervised_fine_tune or builtin_tool_code or safety_with_image or code_interpreter_for or rag_and_code or truncation or register_and_unregister)' --text-model fireworks/meta-llama/Llama-3.3-70B-Instruct --vision-model fireworks/meta-llama/Llama-4-Scout-17B-16E-Instruct --safety-shield llama-guard3:1b --embedding-model all-MiniLM-L6-v2
```
### Related but not obvious in this PR
In the llama-stack-ops repo, we run tests before publishing packages and
docker containers.
The actions in that repo were using the fireworks / together distros (
which are non-existent )
So need to update that to run with `starter` and use `ollama`
specifically for safety.
# What does this PR do?
Adds a template for technical debt. Currently we don't support blank
issues so everything filed has to a bug or a feature.
This would allow maintainers as well as community members to track
things we might want to merge to expose the functionality but should be
addressed later. Such things can also be "good first issues" for new
contributors.
## Example of what we constitute as technical debt
Inelegant code solutions, tests we intend to temporarily disable but
would like to restore, CI hacks around infrastructure or installation,
etc.
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
inference providers each have a static list of supported / known models.
some also have access to a dynamic list of currently available models.
this change gives prodivers using the ModelRegistryHelper the ability to
combine their static and dynamic lists.
for instance, OpenAIInferenceAdapter can implement
```
def query_available_models(self) -> list[str]:
return [entry.model for entry in self.openai_client.models.list()]
```
to augment its static list w/ a current list from openai.
## Test Plan
scripts/unit-test.sh
Remove both the metadata and content from the kvstore when a file is
being removed from the vector store.
Closes: #2685
Also add faiss provider to openai_vector_stores test suite
---------
Signed-off-by: Derek Higgins <derekh@redhat.com>
Co-authored-by: raghotham <rsm@meta.com>
# What does this PR do?
This PR improves documentation clarity around run.yaml file usage. It
adds comprehensive guidance to help users understand that generated
run.yaml files are templates meant to be customized for production use,
not used as-is.
## Changes
- Add new documentation section on customizing run.yaml files
- Clarify that generated run.yaml files are templates, not production
configs
- Add guidance on customization best practices and common scenarios
- Update existing documentation to reference customization guide
- Improve clarity around run.yaml file usage for better user experience
## Test Plan
- Verified new documentation file exists at correct location
- Confirmed documentation is properly integrated into the toctree
structure
- Checked all internal links use correct paths and reference existing
files
- Validated references are added to relevant existing documentation
files
- Documentation build testing will be handled by CI environment
# What does this PR do?
Adds input validation for mode in RagQueryConfig
This will prevent users from inputting search modes other than `vector`
and `keyword` for the time being with `hybrid` to follow when that
functionality is implemented.
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
```
# Check out this PR and enter the LS directory
uv sync --extra dev
```
Run the quickstart
[example](https://llama-stack.readthedocs.io/en/latest/getting_started/#step-3-run-the-demo)
Alter the Agent to include a query_config
```
agent = Agent(
client,
model=model_id,
instructions="You are a helpful assistant",
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {
"vector_db_ids": [vector_db_id],
"query_config": {
"mode": "i-am-not-vector", # Test for non valid search mode
"max_chunks": 6
}
},
}
],
)
```
Ensure you get the following error:
```
400: {'errors': [{'loc': ['mode'], 'msg': "Value error, mode must be either 'vector' or 'keyword' if supported by the vector_io provider", 'type': 'value_error'}]}
```
## Running unit tests
```
uv sync --extra dev
uv run pytest tests/unit/rag/test_rag_query.py -v
```
[//]: # (## Documentation)
# What does this PR do?
- Adds two pages to UI
- Vector stores
- Vector store detail view
- Fixed darkmode navbar highlighting
- Updated darkmode font color
- Updated llama-stack-client package
<img width="1916" height="734" alt="Screenshot 2025-07-12 at 11 34
35 PM"
src="https://github.com/user-attachments/assets/3f9b6727-ee82-4e6b-9555-2e3ef36d24d2"
/>
<img width="1912" height="910" alt="Screenshot 2025-07-12 at 11 57
09 PM"
src="https://github.com/user-attachments/assets/0c9d3b5e-5592-4dfb-8e04-a57edc9fb406"
/>
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
this blocks network access for all `tests/unit/` tests.
`tests/integration/` are untouched.
it also introduces an `allow_network` marker to explicitly allow network
access.
## Test Plan
`./scripts/unit-tests.sh`
# What does this PR do?
Some of our inference providers support passthrough authentication via
`x-llamastack-provider-data` header values. This fixes the providers
that support passthrough auth to not cache their clients to the backend
providers (mostly OpenAI client instances) so that the client connecting
to Llama Stack has to provide those auth values on each and every
request.
## Test Plan
I added some unit tests to ensure we're not caching clients across
requests for all the fixed providers in this PR.
```
uv run pytest -sv tests/unit/providers/inference/test_inference_client_caching.py
```
I also ran some of our OpenAI compatible API integration tests for each
of the changed providers, just to ensure they still work. Note that
these providers don't actually pass all these tests (for unrelated
reasons due to quirks of the Groq and Together SaaS services), but
enough of the tests passed to confirm the clients are still working as
intended.
### Together
```
ENABLE_TOGETHER="together" \
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv \
tests/integration/inference/test_openai_completion.py \
--text-model "together/meta-llama/Llama-3.1-8B-Instruct"
```
### OpenAI
```
ENABLE_OPENAI="openai" \
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv \
tests/integration/inference/test_openai_completion.py \
--text-model "openai/gpt-4o-mini"
```
### Groq
```
ENABLE_GROQ="groq" \
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv \
tests/integration/inference/test_openai_completion.py \
--text-model "groq/meta-llama/Llama-3.1-8B-Instruct"
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Update the shield register validation of Sambanova not to raise, but
only warn when a model is not available in the base url endpoint used,
also added warnings when model is not available in the base url endpoint
used
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
run starter distro with Sambanova enabled
# What does this PR do?
previously, developers who ran `./scripts/unit-tests.sh` would get
`asyncio-mode=auto`, which meant `@pytest.mark.asyncio` and
`@pytest_asyncio.fixture` were redundent. developers who ran `pytest`
directly would get pytest's default (strict mode), would run into errors
leading them to add `@pytest.mark.asyncio` / `@pytest_asyncio.fixture`
to their code.
with this change -
- `asyncio_mode=auto` is included in `pyproject.toml` making behavior
consistent for all invocations of pytest
- removes all redundant `@pytest_asyncio.fixture` and
`@pytest.mark.asyncio`
- for good measure, requires `pytest>=8.4` and `pytest-asyncio>=1.0`
## Test Plan
- `./scripts/unit-tests.sh`
- `uv run pytest tests/unit`
# What does this PR do?
Otherwise we can get old versions like 1.11 and experience this error:
```
ModuleNotFoundError: No module named 'opentelemetry.exporter.otlp.proto.http.metric_exporter'
```
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
COPY with chmod does not work, see
https://github.com/containers/buildah/issues/4614. Also Docker arguably
implements it.
Anyway, this command is not even needed since later don't we do:
```
RUN mkdir -p /.llama /.cache && chmod -R g+rw /app /.llama /.cache
```
And providers.d will get the right modes.
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
Build with CONTAINER_BINARY=podman and success
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
The current authorized sql store implementation does not respect
user.principal (only checks attributes). This PR addresses that.
## Test Plan
Added test cases to integration tests.
# What does this PR do?
the "rfc" directory has only a single document in it, and its the
original RFC for creating Llama Stack
simply the project directory structure by moving this into the "docs"
directory and renaming it to "original_rfc" to preserve the context of
the doc
## Why did you do this?
A simplified top-level directory structure helps keep the project
simpler and prevents misleading new contributors into thinking we use it
(we really don't)
---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
Co-authored-by: raghotham <raghotham@gmail.com>
# What does this PR do?
"install.sh" is something that a general user might not use e.g. it is
specific to using the "ollama" inference provider
cleanup the top-level structure of the repo by moving it into the
"scripts" dir and updating the relevant references accordingly
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
"dev" dependencies were moved in pyproject.toml
typo with guidance around automatic doc generation
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
This PR refactors and the VectorIO backend logic for `sqlite-vec` and
adds unit tests and fixtures to make it easy to test both `sqlite-vec`
and `milvus`.
Key changes:
- `sqlite-vec` migrated to `kvstore` registry
- added in-memory cache for sqlite-vec to be consistent with `milvus`
- default fixtures moved to `conftest.py`
- removed redundant tests from sqlite`-vec`
- made `test_vector_io_openai_vector_stores.py` more easily extensible
## Test Plan
Unit tests added testing inline providers.
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
the project already had some config setup for https://pre-commit.ci/
this commit adds additional explicit fields
Closes#2711
**IMPORTANT:** A project maintainer must add `pre-commit.ci` to this
repo for this to work - this can be done via https://pre-commit.ci/
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
While investigating the `uv.lock` changes made in
https://github.com/meta-llama/llama-stack/pull/2695 I noticed several of
the pre-commit hook versions were out of date
This PR updates them and fixes some new `ruff` errors
---------
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
currently when logging the run yaml, if there are path objects in the
object they are represented as:
```
external_providers_dir: !!python/object/apply:pathlib.PosixPath
- '~'
- .llama
- providers.d
```
now, with a config.model_dump(mode="json"), it works properly
```
external_providers_dir: ~/.llama/providers.d
```
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
We are now automatically building the list of integration test to run.
In that process, eval and files and being tested now.
This is pending https://github.com/meta-llama/llama-stack/pull/2628
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Terminate server process for real.
## Test Plan
```ENABLE_OPENAI=openai LLAMA_STACK_CONFIG=server:starter pytest -v tests/integration/agents/test_openai_responses.py --text-model "gpt-4o-mini" -vv -s -k 'test_list_response_input_items[' && lsof -ti:8321```
observe no process printed anymore
# What does this PR do?
`llama stack run starter` in conda environment fails with ' --config is
required for venv and conda environments' because it is passed as
--template and start_stack.sh doesn't process template.
## Test Plan
`llama stack run starter`
# What does this PR do?
`python-jose` recommends using the `cryptography` backend in their
installation docs:
https://github.com/mpdavis/python-jose?tab=readme-ov-file#cryptographic-backends
This PR modifies the LLS dependencies to use this instead of the current
`native-python`
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
We are now testing the safety capability with the starter image. This
includes a few changes:
* Enable the safety integration test
* Relax the shield model requirements from llama-guard to make it work
with llama-guard3:8b coming from Ollama
* Expose a shield for each inference provider in the starter distro. The
shield will only be registered if the provider is enabled.
Closes: https://github.com/meta-llama/llama-stack/issues/2528
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
Updates some broken or outdated links pointing to the Android Demo App
Signed-off-by: Jorge Garcia Oncins <jgarciao@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack/apis`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
- Fix constructor call missing files_api parameter
- Add kvstore field to MilvusVectorIOConfig
- Resolves#2626
# What does this PR do?
[https://github.com/meta-llama/llama-stack/issues/2626]
## Problem
The `MilvusVectorIOAdapter` fails to initialize due to two missing
configuration issues:
1. Missing `files_api` parameter in the constructor call
2. Missing `kvstore` field in the `MilvusVectorIOConfig` class
## Root Cause
1. The adapter constructor expects 3 parameters `(config, inference_api,
files_api)` but the `get_adapter_impl` function only passes 2 parameters
2. The `MilvusVectorIOConfig` class lacks the `kvstore` field that the
adapter's `initialize()` method expects for metadata persistence
## Solution
- Added `files_api = deps.get(Api.files, None)` to safely retrieve files
API from dependencies
- Pass the files_api parameter to MilvusVectorIOAdapter constructor
- Added `kvstore: KVStoreConfig | None = None` field to
MilvusVectorIOConfig
- Maintains backward compatibility since both files_api and kvstore can
be None
Closes#2626
## Test Plan
- [x] Tested with Milvus configuration - server starts successfully
```yaml
vector_io:
- provider_id: milvus
provider_type: remote::milvus
config:
uri: http://localhost:19530
token: root:Milvus
kvstore:
type: sqlite
namespace: null
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/remote-vllm}/milvus_store.db
```
- [x] Vector operations work as expected
```python
from llama_stack_client import LlamaStackClient
from llama_stack_client.types.shared_params.document import Document as RAGDocument
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger as AgentEventLogger
import os
endpoint = os.getenv("LLAMA_STACK_ENDPOINT")
model = os.getenv("INFERENCE_MODEL")
# Initialize the client
client = LlamaStackClient(base_url=endpoint)
vector_db_id = "my_documents"
response = client.vector_dbs.register(
vector_db_id=vector_db_id,
embedding_model="all-MiniLM-L6-v2",
embedding_dimension=384,
provider_id="milvus",
)
urls = ["getting_started/Red_Hat_AI_Inference_Server-3.0-Getting_started-en-US.pdf", "vllm_server_arguments/Red_Hat_AI_Inference_Server-3.0-vLLM_server_arguments-en-US.pdf"]
documents = [
RAGDocument(
document_id=f"num-{i}",
content=f"https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/3.0/pdf/{url}",
mime_type="application/pdf",
metadata={},
)
for i, url in enumerate(urls)
]
client.tool_runtime.rag_tool.insert(
documents=documents,
vector_db_id=vector_db_id,
chunk_size_in_tokens=512,
)
rag_agent = Agent(
client,
model=model,
# Define instructions for the agent (system prompt)
instructions="You are a helpful assistant",
enable_session_persistence=False,
# Define tools available to the agent
tools=[
{
"name": "builtin::rag/knowledge_search",
"args": {
"vector_db_ids": [vector_db_id],
},
}
],
)
session_id = rag_agent.create_session("test-session")
user_prompts = [
"How to start the AI Inference Server container image? use the knowledge_search tool to get information.",
]
for prompt in user_prompts:
print(f"User> {prompt}")
response = rag_agent.create_turn(
messages=[{"role": "user", "content": prompt}],
session_id=session_id,
)
for log in AgentEventLogger().log(response):
log.print()
```
server logs:
```
INFO 2025-07-04 22:18:30,385 __main__:577 server: Listening on ['::', '0.0.0.0']:5000
INFO: Started server process [769725]
INFO: Waiting for application startup.
INFO 2025-07-04 22:18:30,390 __main__:158 server: Starting up
INFO: Application startup complete.
INFO: Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)
INFO 2025-07-04 22:18:52,193 llama_stack.distribution.routing_tables.common:200 core: Setting owner for vector_db 'my_documents' to
20:18:52.194 [START] /v1/vector-dbs
INFO: 192.168.1.249:64170 - "POST /v1/vector-dbs HTTP/1.1" 200 OK
20:18:52.216 [END] /v1/vector-dbs [StatusCode.OK] (21.89ms)
20:18:52.222 [START] /v1/tool-runtime/rag-tool/insert
INFO 2025-07-04 22:18:56,265 llama_stack.providers.utils.inference.embedding_mixin:102 uncategorized: Loading sentence transformer for
all-MiniLM-L6-v2...
WARNING 2025-07-04 22:18:59,214 opentelemetry.trace:537 uncategorized: Overriding of current TracerProvider is not allowed
INFO 2025-07-04 22:18:59,339 sentence_transformers.SentenceTransformer:219 uncategorized: Use pytorch device_name: cuda:0
INFO 2025-07-04 22:18:59,340 sentence_transformers.SentenceTransformer:227 uncategorized: Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO: 192.168.1.249:64170 - "POST /v1/tool-runtime/rag-tool/insert HTTP/1.1" 200 OK
INFO: 192.168.1.249:64170 - "POST /v1/agents HTTP/1.1" 200 OK
INFO: 192.168.1.249:64170 - "GET /v1/tools?toolgroup_id=builtin%3A%3Arag%2Fknowledge_search HTTP/1.1" 200 OK
INFO: 192.168.1.249:64170 - "POST /v1/agents/b1f6f063-1691-4780-8d9e-facd81708b91/session HTTP/1.1" 200 OK
20:19:01.834 [END] /v1/tool-runtime/rag-tool/insert [StatusCode.OK] (9612.06ms)
20:19:01.839 [START] /v1/agents
INFO: 192.168.1.249:64170 - "POST /v1/agents/b1f6f063-1691-4780-8d9e-facd81708b91/session/d2706302-bb54-421d-a890-5e25df9cb47f/turn HTTP/1.1" 200 OK
20:19:01.839 [END] /v1/agents [StatusCode.OK] (0.18ms)
20:19:01.844 [START] /v1/tools
INFO 2025-07-04 22:19:01,853 llama_stack.providers.remote.inference.vllm.vllm:330 uncategorized: Initializing vLLM client with
base_url=http://192.168.1.183:8080/v1
20:19:01.858 [END] /v1/tools [StatusCode.OK] (14.92ms)
20:19:01.868 [START] /v1/agents/{agent_id}/session
20:19:01.868 [END] /v1/agents/{agent_id}/session [StatusCode.OK] (0.37ms)
20:19:01.873 [START] /v1/agents/{agent_id}/session/{session_id}/turn
20:19:01.885 [START] inference
20:19:05.506 [END] inference [StatusCode.OK] (3621.19ms)
INFO 2025-07-04 22:19:05,537 llama_stack.providers.inline.agents.meta_reference.agent_instance:890 agents: executing tool call: knowledge_search
with args: {'query': 'How to start the AI Inference Server container image'}
20:19:05.538 [START] tool_execution
20:19:05.928 [END] tool_execution [StatusCode.OK] (390.08ms)
20:19:05.538 [INFO] executing tool call: knowledge_search with args: {'query': 'How to start the AI Inference Server container image'}
20:19:05.935 [START] inference
20:19:17.539 [END] inference [StatusCode.OK] (11603.76ms)
20:19:17.560 [END] /v1/agents/{agent_id}/session/{session_id}/turn [StatusCode.OK] (15686.62ms)
```
- [x] No regressions in functionality
- [x] Configuration properly accepts kvstore settings
---------
Co-authored-by: Peter Gustafsson <peter.gustafsson6@gmail.com>
Co-authored-by: raghotham <rsm@meta.com>
Co-authored-by: Francisco Arceo <farceo@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR adds static type coverage to `llama-stack`
Part of https://github.com/meta-llama/llama-stack/issues/2647
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
- fix env variables
- use gpu for vllm
- add eks/apply.py for aws
- add template to set hf secret
## Test Plan
bash apply.sh
Co-authored-by: Eric Huang <erichuang@fb.com>
# What does this PR do?
- Enabling Unit tests for Milvus to start to test OpenAI compatibility
and fixing a few bugs.
- Also fixed an inconsistency in the Milvus config between remote and
inline.
- Added pymilvus to extras for testing in CI
I'm going to refactor this later to include the other inline providers
so that we can catch issues sooner.
I have another PR where I've been testing to find other bugs in the
implementation (and required changes drafted here:
https://github.com/meta-llama/llama-stack/pull/2617).
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
currently when a template is used, we still use `--config`.
`server.py` has a dedicated `--template` flag and logic, use that
instead
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
The `nvidia` distro was previously collapsed into the `starter` distro.
However, the `nvidia` distro was setup specifically to use NVIDIA NeMo
microservices as providers for all APIs and not just inference, which
means it was doing quite a bit more than what the `starter` distro
covers today.
We should work with our friends at NVIDIA to determine the best place to
maintain this distro long-term, but for now this restores the `nvidia`
distro and its docs back to where they were so that things continue to
work for their users.
## Test Plan
I ensure the `nvidia` distro could build, and run at least to the point
of complaining that I didn't provide the necessary API keys.
```
uv run llama stack build --template nvidia --image-type venv
uv run llama stack run llama_stack/templates/nvidia/run.yaml
```
I also made sure the docs website built and looks reasonable, with the
`nvidia` distro docs at the same URL it was previously (because it has
incoming links from official NVIDIA NeMo docs, among other places).
```
uv run --group docs sphinx-autobuild docs/source docs/build/html --write-all
```
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
Rather than pointing to a dir in `llama_stack/templates` (the repo
directory)
we should point to `$BUILD_DIR/IMAGE_NAME-run.yaml`
(`~/.llama/distributions/IMAGE_NAME/IMAGE_NAME-run.yaml`)
currently we are printing:
```
You can find the newly-built template here: /Users/charliedoern/projects/Documents/llama-stack/llama_stack/templates/starter/run.yaml
You can run the new Llama Stack distro via: llama stack run /Users/charliedoern/projects/Documents/llama-stack/llama_stack/templates/starter/run.yaml --image-type venv
```
but should be printing things like:
```
You can find the newly-built template here: /Users/charliedoern/.llama/distributions/starter/starter-run.yaml
You can run the new Llama Stack distro via: llama stack run /Users/charliedoern/.llama/distributions/starter/starter-run.yaml --image-type venv
```
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
The CI was failing but the error was eaten by the pipe. Now we run the
task with pipefail.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
- we are using `all-minilm:l6-v2` but the model we download from ollama
is `all-minilm:latest`
latest: https://ollama.com/library/all-minilm:latest 1b226e2802db
l6-v2: https://ollama.com/library/all-minilm:l6-v2 pin 1b226e2802db
- even currently they are exactly the same model but if
[all-minilm:l12-v2](https://ollama.com/library/all-minilm:l12-v2) is
updated, "latest" might not be the same for l6-v2.
- the only change in this PR is pin the model id in ollama
- also update detailed_tutorial with "starter" to replace deprecated
"ollama".
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
```
>INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
>llama stack build --run --template ollama --image-type venv
...
Build Successful!
You can find the newly-built template here: /home/wenzhou/zdtsw-forking/lls/llama-stack/llama_stack/templates/ollama/run.yaml
....
- metadata:
embedding_dimension: 384
model_id: all-MiniLM-L6-v2
model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType
- embedding
provider_id: ollama
provider_model_id: all-minilm:l6-v2
...
```
test
```
>llama-stack-client inference chat-completion --message "Write me a 2-sentence poem about the moon"
INFO:httpx:HTTP Request: GET http://localhost:8321/v1/models "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://localhost:8321/v1/openai/v1/chat/completions "HTTP/1.1 200 OK"
OpenAIChatCompletion(
id='chatcmpl-04f99071-3da2-44ba-a19f-03b5b7fc70b7',
choices=[
OpenAIChatCompletionChoice(
finish_reason='stop',
index=0,
message=OpenAIChatCompletionChoiceMessageOpenAIAssistantMessageParam(
role='assistant',
content="Here is a 2-sentence poem about the moon:\n\nSilver crescent in the midnight sky,\nLuna's gentle face, a beauty to the eye.",
name=None,
tool_calls=None,
refusal=None,
annotations=None,
audio=None,
function_call=None
),
logprobs=None
)
],
created=1751644429,
model='llama3.2:3b-instruct-fp16',
object='chat.completion',
service_tier=None,
system_fingerprint='fp_ollama',
usage={'completion_tokens': 33, 'prompt_tokens': 36, 'total_tokens': 69, 'completion_tokens_details': None, 'prompt_tokens_details': None}
)
```
---------
Signed-off-by: Wen Zhou <wenzhou@redhat.com>