# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Co-authored-by: Francisco Arceo <farceo@redhat.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
Fixes a couple of errors in PVC/Secret setup and adds context for
expected Hugging Face token
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
# What does this PR do?
Another doc enhancement for
https://github.com/meta-llama/llama-stack/issues/1818
Summary of changes:
- `docs/source/distributions/configuration.md`
- Updated dropdown title to include a more user-friendly description.
- `docs/_static/css/my_theme.css`
- Added styling for `<h3>` elements to set a normal font weight.
- `docs/source/distributions/starting_llama_stack_server.md`
- Changed section headers from bold text to proper markdown headers
(e.g., `##`).
- Improved descriptions for starting Llama Stack server using different
methods (library, container, conda, Kubernetes).
- Enhanced clarity and structure by converting instructions into
markdown headers and improved formatting.
- `docs/source/getting_started/index.md`
- Major restructuring of the "Quick Start" guide:
- Added new introductory section for Llama Stack and its capabilities.
- Reorganized steps into clearer subsections with proper markdown
headers.
- Replaced dropdowns with tabbed content for OS-specific instructions.
- Added detailed steps for setting up and running the Llama Stack server
and client.
- Introduced new sections for running basic inference and building
agents.
- Enhanced readability and visual structure with emojis, admonitions,
and examples.
- `docs/source/providers/index.md`
- Updated the list of LLM inference providers to include "Ollama."
- Expanded the list of vector databases to include "SQLite-Vec."
Let me know if you need further details!
## Test Plan
Renders locally, included screenshot.
# Documentation
For https://github.com/meta-llama/llama-stack/issues/1818
<img width="1332" alt="Screenshot 2025-04-09 at 11 07 12 AM"
src="https://github.com/user-attachments/assets/c106efb9-076c-4059-a4e0-a30fa738585b"
/>
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
Add the content to use AMD GPU as the vLLM server. Split the original
part to two sub chapters,
1. AMD vLLM server
2. NVIDIA vLLM server (orignal)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Signed-off-by: Alex He <alehe@amd.com>
# What does this PR do?
## Test Plan
export MODEL=accounts/fireworks/models/llama4-scout-instruct-basic;
LLAMA_STACK_CONFIG=verification pytest -s -v tests/integration/inference
--vision-model $MODEL --text-model $MODEL
# What does this PR do?
The goal of this PR is to make the pages easier to navigate by surfacing
the child pages on the navbar, updating some of the copy, moving some of
the files around.
Some changes:
1. Clarifying Titles
2. Restructuring "Distributions" more formally in its own page to be
consistent with Providers and adding some clarity to the child pages to
surface them and make them easier to navigate
3. Updated sphinx config to not collapse navigation by default
4. Updated copyright year to be calculated dynamically
5. Moved `docs/source/distributions/index.md` ->
`docs/source/distributions/starting_llama_stack_server.md`
Another for https://github.com/meta-llama/llama-stack/issues/1815
## Test Plan
Tested locally and pages build (screen shots for example).
## Documentation
### Before:

### After:

Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
This is the second attempt to switch to system packages by default. Now
with a hack to detect conda environment - in which case conda image-type
is used.
Note: Conda will only be used when --image-name is unset *and*
CONDA_DEFAULT_ENV is set. This means that users without conda will
correctly fall back to using system packages when no --image-* arguments
are passed at all.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
Uses virtualenv:
```
$ llama stack build --template ollama --image-type venv
$ llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml
[...]
Using virtual environment: /home/ec2-user/src/llama-stack/schedule/.local
[...]
```
Uses system packages (virtualenv already initialized):
```
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
INFO 2025-03-27 20:46:22,882 llama_stack.cli.stack.run:142 server: No image type or image name provided. Assuming environment packages.
[...]
```
Attempt to run from environment packages without necessary packages
installed:
```
$ python -m venv barebones
$ . ./barebones/bin/activate
$ pip install -e . # to install llama command
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
ModuleNotFoundError: No module named 'fastapi'
```
^ failed as expected because the environment doesn't have necessary
packages installed.
Now install some packages in the new environment:
```
$ pip install fastapi opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp aiosqlite ollama openai datasets faiss-cpu mcp autoevals
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
```
Now see if setting CONDA_DEFAULT_ENV will change what happens by
default:
```
$ export CONDA_DEFAULT_ENV=base
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
Using conda environment: base
Conda environment base does not exist.
[...]
```
---------
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
- hide distro doc (docker needs to be thoroughly tested).
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
- docs
[//]: # (## Documentation)
# What does this PR do?
* Fix location of `run.yaml` relative to the cloned llama stack
repository
* Drop `-it` from `docker run` commands as its not needed running
services
## Test Plan
* Verified running the llama stack following updated instruction
CC: @ashwinb
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
# What does this PR do?
This PR adds support for NVIDIA's NeMo Customizer API to the Llama Stack
post-training module. The integration enables users to fine-tune models
using NVIDIA's cloud-based customization service through a consistent
Llama Stack interface.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
Yet to be done
Things pending under this PR:
- [x] Integration of fine-tuned model(new checkpoint) for inference with
nvidia llm distribution
- [x] distribution integration of API
- [x] Add test cases for customizer(In Progress)
- [x] Documentation
```
LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -v tests/client-sdk/post_training/test_supervised_fine_tuning.py
============================================================================================================================================================================ test session starts =============================================================================================================================================================================
platform linux -- Python 3.10.0, pytest-8.3.4, pluggy-1.5.0 -- /home/ubuntu/llama-stack/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.0', 'Platform': 'Linux-6.8.0-1021-gcp-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'nbval': '0.11.0', 'metadata': '3.1.1', 'anyio': '4.8.0', 'html': '4.1.1', 'asyncio': '0.25.3'}}
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: nbval-0.11.0, metadata-3.1.1, anyio-4.8.0, html-4.1.1, asyncio-0.25.3
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 2 items
tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_post_training_provider_registration[txt=8B] PASSED [ 50%]
tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_list_training_jobs[txt=8B] PASSED [100%]
======================================================================================================================================================================== 2 passed, 1 warning in 0.10s ========================================================================================================================================================================
```
cc: @mattf @dglogo @sumitb
---------
Co-authored-by: Ubuntu <ubuntu@llama-stack-customizer-dev-inst-2tx95fyisatvlic4we8hidx5tfj.us-central1-a.c.brevdevprod.internal>
# What does this PR do?
* Removes the use of `huggingface-cli`
* Simplifies HF cache mount path
* Simplifies vLLM server startup command
* Separates PVC/secret creation from deployment/service
* Fixes a typo: "pod" should be "deployment"
Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
# What does this PR do?
Web updates to point to latest releases for Mobile SDK
- point to `latest-release` branch for mobile sdk repos to minimize the
number of change points on the site.
- updates to some instructions
# What does this PR do?
- Fix issue w/ passthrough provider
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
llama stack run
[//]: # (## Documentation)
# What does this PR do?
remove Llama-3.2-1B-Instruct for fireworks as its no longer appears to
be hosted on website.
## Test Plan
python distro_codegen.py
# What does this PR do?
setting $LLAMA_STACK_LOG_FILE will pipe the logs to a file as well as
stdout. this is done by using a logging FileHandler
Signed-off-by: Charlie Doern <cdoern@redhat.com>
This is unfortunate because `sqlite-vec` seems promising. But its PIP
package is not quite complete. It does not have binary for arm64 (I
think, or maybe it even lacks 64 bit builds?) which results in the arm64
container resulting in
```
File "/usr/local/lib/python3.10/site-packages/sqlite_vec/init.py", line 17, in load
conn.load_extension(loadable_path())
sqlite3.OperationalError: /usr/local/lib/python3.10/site-packages/sqlite_vec/vec0.so: wrong ELF class: ELFCLASS32
```
To get around I tried to install from source via `uv pip install
sqlite-vec --no-binary=sqlite-vec` however it even lacks a source
distribution which makes that impossible.
## Test Plan
Build the container locally using:
```bash
LLAMA_STACK_DIR=. llama stack build --template ollama --image-type container
```
Run the container as:
```
podman run --privileged -it -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://host.containers.internal:11434 \
-v ~/local/llama-stack:/app/llama-stack-source
localhost/distribution-ollama:dev --port $LLAMA_STACK_PORT
```
Verify the container starts up correctly. Without this patch, it would
encounter the ELFCLASS32 error.
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
It should changed in this pr
https://github.com/meta-llama/llama-stack/pull/1190/files#diff-53e3f35ced54ee5e57dc8b0d3b04770ed84f2f6434c6f492f42569b3c2810ecd
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
using `formatter_class=argparse.ArgumentDefaultsHelpFormatter` displays
(default: DEFAULT_VALUE) for each flag. add this formatter class to
build and run to show users some default values like `conda`, `8321`,
etc
## Test Plan
ran locally with following output:
before:
```
llama stack run --help
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--disable-ipv6] [--env KEY=VALUE] [--tls-keyfile TLS_KEYFILE] [--tls-certfile TLS_CERTFILE]
[--image-type {conda,container,venv}]
config
Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.
positional arguments:
config Path to config file to use for the run
options:
-h, --help show this help message and exit
--port PORT Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. Defaults to 8321
--image-name IMAGE_NAME
Name of the image to run. Defaults to the current conda environment
--disable-ipv6 Disable IPv6 support
--env KEY=VALUE Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times.
--tls-keyfile TLS_KEYFILE
Path to TLS key file for HTTPS
--tls-certfile TLS_CERTFILE
Path to TLS certificate file for HTTPS
--image-type {conda,container,venv}
Image Type used during the build. This can be either conda or container or venv.
```
after:
```
llama stack run --help
usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--disable-ipv6] [--env KEY=VALUE] [--tls-keyfile TLS_KEYFILE] [--tls-certfile TLS_CERTFILE]
[--image-type {conda,container,venv}]
config
Start the server for a Llama Stack Distribution. You should have already built (or downloaded) and configured the distribution.
positional arguments:
config Path to config file to use for the run
options:
-h, --help show this help message and exit
--port PORT Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. (default: 8321)
--image-name IMAGE_NAME
Name of the image to run. Defaults to the current conda environment (default: None)
--disable-ipv6 Disable IPv6 support (default: False)
--env KEY=VALUE Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times. (default: [])
--tls-keyfile TLS_KEYFILE
Path to TLS key file for HTTPS (default: None)
--tls-certfile TLS_CERTFILE
Path to TLS certificate file for HTTPS (default: None)
--image-type {conda,container,venv}
Image Type used during the build. This can be either conda or container or venv. (default: conda)
```
[//]: # (## Documentation)
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
Based on the client output changed, so the output is incorrect:
458e20702b/src/llama_stack_client/lib/cli/models/models.py (L52)
and
https://github.com/meta-llama/llama-stack/pull/1348#pullrequestreview-2654971315
previous discussion that no need to maintain the output, so remove it.
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
currently logcat is not documented for build && run. Add documentation
in building_distro.md
Signed-off-by: Charlie Doern <cdoern@redhat.com>
All of the tests from `llama_stack/providers/tests/` are now moved to
`tests/integration`.
I converted the `tools`, `scoring` and `datasetio` tests to use API.
However, `eval` and `post_training` proved to be a bit challenging to
leaving those. I think `post_training` should be relatively
straightforward also.
As part of this, I noticed that `wolfram_alpha` tool wasn't added to
some of our commonly used distros so I added it. I am going to remove a
lot of code duplication from distros next so while this looks like a
one-off right now, it will go away and be there uniformly for all
distros.
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
It would be better to tell user env var usage in help text.
```
before:
$ llama stack run --help
--port PORT Port to run the server on. Defaults to 8321
after
$ llama stack run --help
--port PORT Port to run the server on. It can also be passed via the env var LLAMA_STACK_PORT. Defaults to 8321
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
21ec67356c/distributions
It should missed the `s`.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
Since released the `--downloaded` option, so update the related
documents.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
55eb257459/llama_stack/cli/stack/run.py (L22)55eb257459/llama_stack/cli/stack/_build.py (L103)
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>
Each model known to the system has two identifiers:
- the `provider_resource_id` (what the provider calls it) -- e.g.,
`accounts/fireworks/models/llama-v3p1-8b-instruct`
- the `identifier` (`model_id`) under which it is registered and gets
routed to the appropriate provider.
We have so far used the HuggingFace repo alias as the standardized
identifier you can use to refer to the model. So in the above example,
we'd use `meta-llama/Llama-3.1-8B-Instruct` as the name under which it
gets registered. This makes it convenient for users to refer to these
models across providers.
However, we forgot to register the _actual_ provider model ID also. You
should be able to route via `provider_resource_id` also, of course.
This change fixes this (somewhat grave) omission.
*Note*: this change is additive -- more aliases work now compared to
before.
## Test Plan
Run the following for distro=(ollama fireworks together)
```
LLAMA_STACK_CONFIG=$distro \
pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model=meta-llama/Llama-3.1-8B-Instruct --vision-inference-model=""
```
Groq has never supported raw completions anyhow. So this makes it easier
to switch it to LiteLLM. All our test suite passes.
I also updated all the openai-compat providers so they work with api
keys passed from headers. `provider_data`
## Test Plan
```bash
LLAMA_STACK_CONFIG=groq \
pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model=groq/llama-3.3-70b-versatile --vision-inference-model=""
```
Also tested (openai, anthropic, gemini) providers. No regressions.
# What does this PR do?
Model context protocol (MCP) allows for remote tools to be connected
with Agents. The current Ollama provider does not support it. This PR
adds necessary code changes to ensure that the integration between
Ollama backend and MCP works.
This PR is an extension of #816 for Ollama.
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
1. Run llama-stack server with the command:
```
llama stack build --template ollama --image-type conda
llama stack run ./templates/ollama/run.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://localhost:11434
```
2. Run the sample client agent with MCP tool:
```
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types.agent_create_params import AgentConfig
from llama_stack_client.types.shared_params.url import URL
from llama_stack_client import LlamaStackClient
from termcolor import cprint
## Start the local MCP server
# git clone https://github.com/modelcontextprotocol/python-sdk
# Follow instructions to get the env ready
# cd examples/servers/simple-tool
# uv run mcp-simple-tool --transport sse --port 8000
# Connect to the llama stack server
base_url="http://localhost:8321"
model_id="meta-llama/Llama-3.2-3B-Instruct"
client = LlamaStackClient(base_url=base_url)
# Register MCP tools
client.toolgroups.register(
toolgroup_id="mcp::filesystem",
provider_id="model-context-protocol",
mcp_endpoint=URL(uri="http://localhost:8000/sse"))
# Define an agent with MCP toolgroup
agent_config = AgentConfig(
model=model_id,
instructions="You are a helpful assistant",
toolgroups=["mcp::filesystem"],
input_shields=[],
output_shields=[],
enable_session_persistence=False,
)
agent = Agent(client, agent_config)
user_prompts = [
"Fetch content from https://www.google.com and print the response"
]
# Run a session with the agent
session_id = agent.create_session("test-session")
for prompt in user_prompts:
cprint(f"User> {prompt}", "green")
response = agent.create_turn(
messages=[
{
"role": "user",
"content": prompt,
}
],
session_id=session_id,
)
for log in EventLogger().log(response):
log.print()
```
# Documentation
The file docs/source/distributions/self_hosted_distro/ollama.md is
updated to indicate the MCP tool runtime availability.
Signed-off-by: Shreyanand <shanand@redhat.com>
# What does this PR do?
Create a distribution template using Groq as inference provider.
Link to issue: https://github.com/meta-llama/llama-stack/issues/958
## Test Plan
Run `python llama_stack/scripts/distro_codegen.py` to generate run.yaml
and build.yaml
Test the newly created template by running
`llama stack build --template <template-name>`
`llama stack run <template-name>`
Embedding models are tiny and can be pulled on-demand. Let's do that so
the user doesn't have to do "yet another thing" to get themselves set
up.
Thanks @hardikjshah for the suggestion.
Also fixed a build dependency miss (TODO: distro_codegen needs to
actually check that the build template contains all providers mentioned
for the run.yaml file)
## Test Plan
First run `ollama rm all-minilm:latest`.
Run `llama stack build --template ollama && llama stack run ollama --env
INFERENCE_MODEL=llama3.2:3b-instruct-fp16`. See that it outputs a
"Pulling embedding model `all-minilm:latest`" output and the stack
starts up correctly. Verify that `ollama list` shows the model is
correctly downloaded.
# What does this PR do?
add /v1/inference/embeddings implementation to NVIDIA provider
**open topics** -
- *asymmetric models*. NeMo Retriever includes asymmetric models, which
are models that embed differently depending on if the input is destined
for storage or lookup against storage. the /v1/inference/embeddings api
does not allow the user to indicate the type of embedding to perform.
see https://github.com/meta-llama/llama-stack/issues/934
- *truncation*. embedding models typically have a limited context
window, e.g. 1024 tokens is common though newer models have 8k windows.
when the input is larger than this window the endpoint cannot perform
its designed function. two options: 0. return an error so the user can
reduce the input size and retry; 1. perform truncation for the user and
proceed (common strategies are left or right truncation). many users
encounter context window size limits and will struggle to write reliable
programs. this struggle is especially acute without access to the
model's tokenizer. the /v1/inference/embeddings api does not allow the
user to delegate truncation policy. see
https://github.com/meta-llama/llama-stack/issues/933
- *dimensions*. "Matryoshka" embedding models are available. they allow
users to control the number of embedding dimensions the model produces.
this is a critical feature for managing storage constraints. embeddings
of 1024 dimensions what achieve 95% recall for an application may not be
worth the storage cost if a 512 dimensions can achieve 93% recall.
controlling embedding dimensions allows applications to determine their
recall and storage tradeoffs. the /v1/inference/embeddings api does not
allow the user to control the output dimensions. see
https://github.com/meta-llama/llama-stack/issues/932
## Test Plan
- `llama stack run llama_stack/templates/nvidia/run.yaml`
- `LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v
tests/client-sdk/inference/test_embedding.py --embedding-model
baai/bge-m3`
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
We have support for embeddings in our Inference providers, but so far we
haven't done the final step of actually registering the known embedding
models and making sure they are extremely easy to use. This is one step
towards that.
## Test Plan
Run existing inference tests.
```bash
$ cd llama_stack/providers/tests/inference
$ pytest -s -v -k fireworks test_embeddings.py \
--inference-model nomic-ai/nomic-embed-text-v1.5 --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k together test_embeddings.py \
--inference-model togethercomputer/m2-bert-80M-8k-retrieval --env EMBEDDING_DIMENSION=784
$ pytest -s -v -k ollama test_embeddings.py \
--inference-model all-minilm:latest --env EMBEDDING_DIMENSION=784
```
The value of the EMBEDDING_DIMENSION isn't actually used in these tests,
it is merely used by the test fixtures to check if the model is an LLM
or Embedding.
# What does this PR do?
Create a script for running all client-sdk tests on Async Library
client, with the option to generate report
## Test Plan
```
python llama_stack/scripts/run_client_sdk_tests.py --templates together fireworks --report
```
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
Before this change, `distro_codegen.py` would only work if the user
manually installed multiple provider-specific dependencies (see #1122).
Now, users can run `distro_codegen.py` without any provider-specific
dependencies because we avoid importing the entire provider
implementations just to get the config needed to build the provider
template.
Concretely, this mostly means moving the
MODEL_ALIASES (and related variants) definitions to a new models.py
class within the provider implementation for those providers that
require additional dependencies. It also meant moving a couple of
imports from top-level imports to inside `get_adapter_impl` for some
providers, which follows the pattern used by multiple existing
providers.
To ensure we don't regress and accidentally add new imports that cause
distro_codegen.py to fail, the stubbed-in pre-commit hook for
distro_codegen.py was uncommented and slightly tweaked to run via `uv
run python ...` to ensure it runs with only the project's default
dependencies and to run automatically instead of manually.
Lastly, this updates distro_codegen.py itself to keep track of paths it
might have changed and to only `git diff` those specific paths when
checking for changed files instead of doing a diff on the entire working
tree. The latter was overly broad and would require a user have no other
unstaged changes in their working tree, even if those unstaged changes
were unrelated to generated code. Now it only flags uncommitted changes
for paths distro_codegen.py actually writes to.
Our generated code was also out-of-date, presumably because of these
issues, so this commit also has some updates to the generated code
purely because it was out of sync, and the pre-commit hook now enforces
things to be updated.
(Closes#1122)
## Test Plan
I manually tested distro_codegen.py and the pre-commit hook to verify
those work as expected, flagging any uncommited changes and catching any
imports that attempt to pull in provider-specific dependencies.
However, I do not have valid api keys to the impacted provider
implementations, and am unable to easily run the inference tests against
each changed provider. There are no functional changes to the provider
implementations here, but I'd appreciate a second set of eyes on the
changed import statements and moving of MODEL_ALIASES type code to a
separate models.py to ensure I didn't make any obvious errors.
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
From the code and the usage, seems cannot see that need to use
`--no-list-templates` to handle, and also make the user confused from
the help text, so try to remove it.
```
$ llama stack build --no-list-templates
> Enter a name for your Llama Stack (e.g. my-local-stack):
$ llama stack build
> Enter a name for your Llama Stack (e.g. my-local-stack):
before:
$ llama stack build --help
--list-templates, --no-list-templates
Show the available templates for building a Llama Stack distribution (default: False)
after:
--list-templates Show the available templates for building a Llama Stack distribution
```
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: reidliu <reid201711@gmail.com>
Co-authored-by: reidliu <reid201711@gmail.com>