# What does this PR do?
This PR adds the "TAVILY_SEARCH_API_KEY" option to the playground to
enable the use of the websearch tool.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
```
export TAVILY_SEARCH_API_KEY=***
streamlit run llama_stack/distribution/ui/app.py
```
Without this change the builtin websearch tool will fail due to missing
API key.
[//]: # (## Documentation)
Related to #1902
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
# What does this PR do?
This PR adds an additional page to the playground called "Tools". This
page connects to a llama-stack server and lists all the available LLM
models, builtin tools and MCP tools in the sidebar. Users can select
whatever combination of model and tools they want from the sidebar for
their agent. Once the selections are made, users can chat with their
agent similarly to the RAG page and test out agent tool use.
closes#1902
## Test Plan
Ran the following commands with a llama-stack server and the updated
playground worked as expected.
```
export LLAMA_STACK_ENDPOINT="http://localhost:8321"
streamlit run llama_stack/distribution/ui/app.py
```
[//]: # (## Documentation)
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
**What does this PR do?**
This PR fixes a build issue with the Containerfile caused by missing
requirement `llama-stack`. It updates the Containerfile to include the
necessary requirements and upgrades the Python version to ensure
successful builds.
**Test Plan**
The updated Containerfile has been tested, and the build now completes
successfully with the required dependencies included.
# What does this PR do?
Fixes issue #1537 that causes "500 Internal Server Error" when
unregistering a toolgroup
# (Closes#1537 )
## Test Plan
```console
$ pytest -s -v tests/integration/tool_runtime/test_registration.py --stack-config=ollama --env INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
INFO 2025-03-14 21:15:03,999 tests.integration.conftest:41 tests: Setting DISABLE_CODE_SANDBOX=1 for macOS
/opt/homebrew/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
===================================================== test session starts =====================================================
platform darwin -- Python 3.10.16, pytest-8.3.5, pluggy-1.5.0 -- /opt/homebrew/opt/python@3.10/bin/python3.10
cachedir: .pytest_cache
rootdir: /Users/paolo/Projects/aiplatform/llama-stack
configfile: pyproject.toml
plugins: asyncio-0.25.3, anyio-4.8.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 1 item
tests/integration/tool_runtime/test_registration.py::test_register_and_unregister_toolgroup[None-None-None-None-None] INFO 2025-03-14 21:15:04,478 llama_stack.providers.remote.inference.ollama.ollama:75 inference: checking
connectivity to Ollama at `http://localhost:11434`...
INFO 2025-03-14 21:15:05,350 llama_stack.providers.remote.inference.ollama.ollama:294 inference: Pulling embedding
model `all-minilm:latest` if necessary...
INFO: Started server process [78391]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 127.0.0.1:57424 - "GET /sse HTTP/1.1" 200 OK
INFO: 127.0.0.1:57434 - "GET /sse HTTP/1.1" 200 OK
INFO 2025-03-14 21:15:16,129 mcp.client.sse:51 uncategorized: Connecting to SSE endpoint: http://localhost:8000/sse
INFO: 127.0.0.1:57445 - "GET /sse HTTP/1.1" 200 OK
INFO 2025-03-14 21:15:16,146 mcp.client.sse:71 uncategorized: Received endpoint URL:
http://localhost:8000/messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b
INFO 2025-03-14 21:15:16,147 mcp.client.sse:140 uncategorized: Starting post writer with endpoint URL:
http://localhost:8000/messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b
INFO: 127.0.0.1:57447 - "POST /messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b HTTP/1.1" 202 Accepted
INFO: 127.0.0.1:57447 - "POST /messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b HTTP/1.1" 202 Accepted
INFO: 127.0.0.1:57447 - "POST /messages/?session_id=c5b6fc01f8dc4b5e80e38eb1c1b22a9b HTTP/1.1" 202 Accepted
INFO 2025-03-14 21:15:16,155 mcp.server.lowlevel.server:535 uncategorized: Processing request of type
ListToolsRequest
PASSED
=============================================== 1 passed, 4 warnings in 12.17s ================================================
```
---------
Signed-off-by: Paolo Dettori <dettori@us.ibm.com>
# What does this PR do?
closes#1853
## Test Plan
```
uv run llama stack build --image-type conda --image-name ollama --config llama_stack/templates/ollama/build.yaml
ollama pull llama3.2:3b
LLAMA_STACK_CONFIG=http://localhost:8321 uv run pytest tests/integration/inference/test_text_inference.py -v --text-model=llama3.2:3b
```
# What does this PR do?
Providers that live outside of the llama-stack codebase are now
supported.
A new property `external_providers_dir` has been added to the main
config and can be configured as follow:
```
external_providers_dir: /etc/llama-stack/providers.d/
```
Where the expected structure is:
```
providers.d/
inference/
custom_ollama.yaml
vllm.yaml
vector_io/
qdrant.yaml
```
Where `custom_ollama.yaml` is:
```
adapter:
adapter_type: custom_ollama
pip_packages: ["ollama", "aiohttp"]
config_class: llama_stack_ollama_provider.config.OllamaImplConfig
module: llama_stack_ollama_provider
api_dependencies: []
optional_api_dependencies: []
```
Obviously the package must be installed on the system, here is the
`llama_stack_ollama_provider` example:
```
$ uv pip show llama-stack-ollama-provider
Using Python 3.10.16 environment at: /Users/leseb/Documents/AI/llama-stack/.venv
Name: llama-stack-ollama-provider
Version: 0.1.0
Location: /Users/leseb/Documents/AI/llama-stack/.venv/lib/python3.10/site-packages
Editable project location: /private/var/folders/mq/rnm5w_7s2d3fxmtkx02knvhm0000gn/T/tmp.ZBHU5Ezxg4/ollama/llama-stack-ollama-provider
Requires:
Required-by:
```
Closes: https://github.com/meta-llama/llama-stack/issues/658
Signed-off-by: Sébastien Han <seb@redhat.com>
Add the content to use AMD GPU as the vLLM server. Split the original
part to two sub chapters,
1. AMD vLLM server
2. NVIDIA vLLM server (orignal)
# What does this PR do?
[Provide a short summary of what this PR does and why. Link to relevant
issues if applicable.]
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
---------
Signed-off-by: Alex He <alehe@amd.com>
Move the test_context.py under the main tests directory, and fix the
code.
The problem was that the function captures the initial values of the
context variables and then restores those same initial values before
each iteration. This means that any modifications made to the context
variables during iteration are lost when the next iteration starts.
Error was:
```
====================================================== FAILURES =======================================================
______________________________________ test_preserve_contexts_across_event_loops ______________________________________
@pytest.mark.asyncio
async def test_preserve_contexts_across_event_loops():
"""
Test that context variables are preserved across event loop boundaries with nested generators.
This simulates the real-world scenario where:
1. A new event loop is created for each streaming request
2. The async generator runs inside that loop
3. There are multiple levels of nested generators
4. Context needs to be preserved across these boundaries
"""
# Create context variables
request_id = ContextVar("request_id", default=None)
user_id = ContextVar("user_id", default=None)
# Set initial values
# Results container to verify values across thread boundaries
results = []
# Inner-most generator (level 2)
async def inner_generator():
# Should have the context from the outer scope
yield (1, request_id.get(), user_id.get())
# Modify one context variable
user_id.set("user-modified")
# Should reflect the modification
yield (2, request_id.get(), user_id.get())
# Middle generator (level 1)
async def middle_generator():
inner_gen = inner_generator()
# Forward the first yield from inner
item = await inner_gen.__anext__()
yield item
# Forward the second yield from inner
item = await inner_gen.__anext__()
yield item
request_id.set("req-modified")
# Add our own yield with both modified variables
yield (3, request_id.get(), user_id.get())
# Function to run in a separate thread with a new event loop
def run_in_new_loop():
# Create a new event loop for this thread
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
# Outer generator (runs in the new loop)
async def outer_generator():
request_id.set("req-12345")
user_id.set("user-6789")
# Wrap the middle generator
wrapped_gen = preserve_contexts_async_generator(middle_generator(), [request_id, user_id])
# Process all items from the middle generator
async for item in wrapped_gen:
# Store results for verification
results.append(item)
# Run the outer generator in the new loop
loop.run_until_complete(outer_generator())
finally:
loop.close()
# Run the generator chain in a separate thread with a new event loop
with ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(run_in_new_loop)
future.result() # Wait for completion
# Verify the results
assert len(results) == 3
# First yield should have original values
assert results[0] == (1, "req-12345", "user-6789")
# Second yield should have modified user_id
assert results[1] == (2, "req-12345", "user-modified")
# Third yield should have both modified values
> assert results[2] == (3, "req-modified", "user-modified")
E AssertionError: assert (3, 'req-modified', 'user-6789') == (3, 'req-modified', 'user-modified')
E
E At index 2 diff: 'user-6789' != 'user-modified'
E
E Full diff:
E (
E 3,
E 'req-modified',
E - 'user-modified',
E + 'user-6789',
E )
tests/unit/distribution/test_context.py:155: AssertionError
-------------------------------------------------- Captured log call --------------------------------------------------
ERROR asyncio:base_events.py:1758 Task was destroyed but it is pending!
task: <Task pending name='Task-7' coro=<<async_generator_athrow without __name__>()>>
================================================== warnings summary ===================================================
.venv/lib/python3.10/site-packages/pydantic/fields.py:1042
/Users/leseb/Documents/AI/llama-stack/.venv/lib/python3.10/site-packages/pydantic/fields.py:1042: PydanticDeprecatedSince20: Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'contentEncoding'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.10/migration/
warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=============================================== short test summary info ===============================================
FAILED tests/unit/distribution/test_context.py::test_preserve_contexts_across_event_loops - AssertionError: assert (3, 'req-modified', 'user-6789') == (3, 'req-modified', 'user-modified')
At index 2 diff: 'user-6789' != 'user-modified'
Full diff:
(
3,
'req-modified',
- 'user-modified',
+ 'user-6789',
)
```
[//]: # (## Documentation)
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
This PR updates the [playground RAG
example](llama_stack/distribution/ui/page/playground/rag.py) so that the
agent is able to use its builtin conversation history. Here we are using
streamlit's `cache_resource` functionality to prevent the agent from
re-initializing after every interaction as well as storing its
session_id in the `session_state`. This allows the agent in the RAG
example to behave more closely to how it works using the python-client
directly.
[//]: # (If resolving an issue, uncomment and update the line below)
Closes#1869
## Test Plan
Without these changes, if you ask it "What is 2 + 2"? followed by the
question "What did I just ask?" It will provide an obviously incorrect
answer.
With these changes, you can ask the same series of questions and it will
provide the correct answer.
[//]: # (## Documentation)
Signed-off-by: Michael Clifford <mcliffor@redhat.com>
# What does this PR do?
## Test Plan
export MODEL=accounts/fireworks/models/llama4-scout-instruct-basic;
LLAMA_STACK_CONFIG=verification pytest -s -v tests/integration/inference
--vision-model $MODEL --text-model $MODEL
# What does this PR do?
Move around bits. This makes the copies from llama-models _much_ easier
to maintain and ensures we don't entangle meta-reference specific
tidbits into llama-models code even by accident.
Also, kills the meta-reference-quantized-gpu distro and rolls
quantization deps into meta-reference-gpu.
## Test Plan
```
LLAMA_MODELS_DEBUG=1 \
with-proxy llama stack run meta-reference-gpu \
--env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct \
--env INFERENCE_CHECKPOINT_DIR=<DIR> \
--env MODEL_PARALLEL_SIZE=4 \
--env QUANTIZATION_TYPE=fp8_mixed
```
Start a server with and without quantization. Point integration tests to
it using:
```
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config http://localhost:8321 --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
Running full Tool Calling required some updates to work e2e.
- Remove `python_start` and `python_end` tags
- Tool Call messages and Tool Resposne messages should end with
`<|eom|>`
- System prompt needed updates
```
You are a helpful assisant who can can answer general questions or invoke tools when necessary.
In addition to tool calls, you should also augment your responses by using the tool outputs.
```
### Test Plan
- Start server with meta-reference
```
LLAMA_STACK_DISABLE_VERSION_CHECK=1 LLAMA_MODELS_DEBUG=1 INFERENCE_MODEL=meta-llama/$MODEL llama stack run meta-reference-gpu
```
- Added **NEW** tests with 5 test cases for multi-turn tool calls
```
pytest -s -v --stack-config http://localhost:8321 tests/integration/inference/test_text_inference.py --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct
```
- Also verified all vision and agent tests pass
# What does this PR do?
- **chore: mypy for strong_typing**
- **chore: mypy for remote::vllm**
- **chore: mypy for remote::ollama**
- **chore: mypy for providers.datatype**
---------
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
Don't return list for runtime tools. Instead return Response object for
pagination and consistency with other APIs.
---------
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
Move pagination logic from LocalFS and HuggingFace implementations into
a common helper function to ensure consistent pagination behavior across
providers. This reduces code duplication and centralizes pagination
logic in one place.
## Test Plan
Run this script:
```
from llama_stack_client import LlamaStackClient
# Initialize the client
client = LlamaStackClient(base_url="http://localhost:8321")
# Register a dataset
response = client.datasets.register(
purpose="eval/messages-answer", # or "eval/question-answer" or "post-training/messages"
source={"type": "uri", "uri": "huggingface://datasets/llamastack/simpleqa?split=train"},
dataset_id="my_dataset", # optional, will be auto-generated if not provided
metadata={"description": "My evaluation dataset"}, # optional
)
# Verify the dataset was registered by listing all datasets
datasets = client.datasets.list()
print(f"Registered datasets: {[d.identifier for d in datasets]}")
# You can then access the data using the datasetio API
# rows = client.datasets.iterrows(dataset_id="my_dataset", start_index=1, limit=2)
rows = client.datasets.iterrows(dataset_id="my_dataset")
print(f"Data: {rows.data}")
```
And play with `start_index` and `limit`.
[//]: # (## Documentation)
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
https://github.com/meta-llama/llama-stack/pull/1828 removed
__root_span__ attribute which is still needed
## Test Plan
added telemetry integration test
LLAMA_STACK_CONFIG=http://localhost:5001 pytest -s -v
tests/integration/telemetry --safety-shield meta-llama/Llama-Guard-3-8B
--text-model accounts/fireworks/models/llama-v3p3-70b-instruct
# What does this PR do?
This PR converts blocking Milvus Client calls to non-blocking.
Another one for https://github.com/meta-llama/llama-stack/issues/1489
## Test Plan
I ran the integration tests from
https://github.com/meta-llama/llama-stack/pull/1467 with:
```python
pytest -s -v tests/integration/vector_io/test_vector_io.py \
--stack-config inference=sentence-transformers,vector_io=inline::milvus \
--embedding-model all-miniLM-L6-V2 --env MILVUS_DB_PATH=/tmp/moo.db
INFO 2025-03-28 21:35:22,726 tests.integration.conftest:41 tests: Setting DISABLE_CODE_SANDBOX=1 for macOS
/Users/farceo/dev/llama-stack/.venv/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=============================================================================================================================================================================================================================================================== test session starts ===============================================================================================================================================================================================================================================================
platform darwin -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /Users/farceo/dev/llama-stack/.venv/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.10.16', 'Platform': 'macOS-15.3.1-arm64-arm-64bit', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'cov': '6.0.0', 'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.3', 'anyio': '4.8.0', 'nbval': '0.11.0'}}
rootdir: /Users/farceo/dev/llama-stack
configfile: pyproject.toml
plugins: cov-6.0.0, html-4.1.1, metadata-3.1.1, asyncio-0.25.3, anyio-4.8.0, nbval-0.11.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 7 items
tests/integration/vector_io/test_vector_io.py::test_vector_db_retrieve[emb=all-miniLM-L6-V2] PASSED
tests/integration/vector_io/test_vector_io.py::test_vector_db_register[emb=all-miniLM-L6-V2] PASSED
tests/integration/vector_io/test_vector_io.py::test_insert_chunks[emb=all-miniLM-L6-V2-test_case0] PASSED
tests/integration/vector_io/test_vector_io.py::test_insert_chunks[emb=all-miniLM-L6-V2-test_case1] PASSED
tests/integration/vector_io/test_vector_io.py::test_insert_chunks[emb=all-miniLM-L6-V2-test_case2] PASSED
tests/integration/vector_io/test_vector_io.py::test_insert_chunks[emb=all-miniLM-L6-V2-test_case3] PASSED
tests/integration/vector_io/test_vector_io.py::test_insert_chunks[emb=all-miniLM-L6-V2-test_case4] PASSED
========================================================================================================================================================================================================================================================= 7 passed, 2 warnings in 40.33s ==========================================================================================================================================================================================================================================================
```
[//]: # (## Documentation)
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Adding chunk_size_in_tokens to playground rag_tool insert.
# Closes#1825
## Test Plan
Tested locally.
[//]: # (## Documentation)
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
This is the second attempt to switch to system packages by default. Now
with a hack to detect conda environment - in which case conda image-type
is used.
Note: Conda will only be used when --image-name is unset *and*
CONDA_DEFAULT_ENV is set. This means that users without conda will
correctly fall back to using system packages when no --image-* arguments
are passed at all.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
Uses virtualenv:
```
$ llama stack build --template ollama --image-type venv
$ llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml
[...]
Using virtual environment: /home/ec2-user/src/llama-stack/schedule/.local
[...]
```
Uses system packages (virtualenv already initialized):
```
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
INFO 2025-03-27 20:46:22,882 llama_stack.cli.stack.run:142 server: No image type or image name provided. Assuming environment packages.
[...]
```
Attempt to run from environment packages without necessary packages
installed:
```
$ python -m venv barebones
$ . ./barebones/bin/activate
$ pip install -e . # to install llama command
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
ModuleNotFoundError: No module named 'fastapi'
```
^ failed as expected because the environment doesn't have necessary
packages installed.
Now install some packages in the new environment:
```
$ pip install fastapi opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp aiosqlite ollama openai datasets faiss-cpu mcp autoevals
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
```
Now see if setting CONDA_DEFAULT_ENV will change what happens by
default:
```
$ export CONDA_DEFAULT_ENV=base
$ llama stack run ~/.llama/distributions/ollama/ollama-run.yaml
[...]
Using conda environment: base
Conda environment base does not exist.
[...]
```
---------
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
* Fix location of `run.yaml` relative to the cloned llama stack
repository
* Drop `-it` from `docker run` commands as its not needed running
services
## Test Plan
* Verified running the llama stack following updated instruction
CC: @ashwinb
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
# What does this PR do?
Set a formatter for log file handler that does not pollute log messages
with color tags.
## Test Plan
Successfully tested with `LLAMA_STACK_LOG_FILE=server.log llama stack
run ...`
Fixes multiple issues
1. llama stack build of dependencies was breaking with incompatible
numpy / pandas when importing datasets
Moved the notebook to start a local server instead of using library as a
client. This way the setup is cleaner since its all contained and by
using `uv run --with` we can test both the server setup process too in
CI and release time.
2. The change to [1] surfaced some other issues
- running `llama stack run` was defaulting to conda env name
- provider data was not being managed properly
- Some notebook cells (telemetry for evals) were not updated with latest
changes
Fixed all the issues and update the notebook.
### Test
1. Manually run it all in local env
2. `pytest -v -s --nbval-lax docs/getting_started.ipynb`
# What does this PR do?
This is to stay consistent with other APIs.
This change registers files in API, even though there are still no
providers. Removing tests that require a provider existing for a merged
API to enable it in API layer.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
[//]: # (## Documentation)
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
# What does this PR do?
This PR adds support for NVIDIA's NeMo Customizer API to the Llama Stack
post-training module. The integration enables users to fine-tune models
using NVIDIA's cloud-based customization service through a consistent
Llama Stack interface.
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
[Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.*]
Yet to be done
Things pending under this PR:
- [x] Integration of fine-tuned model(new checkpoint) for inference with
nvidia llm distribution
- [x] distribution integration of API
- [x] Add test cases for customizer(In Progress)
- [x] Documentation
```
LLAMA_STACK_BASE_URL=http://localhost:5002 pytest -v tests/client-sdk/post_training/test_supervised_fine_tuning.py
============================================================================================================================================================================ test session starts =============================================================================================================================================================================
platform linux -- Python 3.10.0, pytest-8.3.4, pluggy-1.5.0 -- /home/ubuntu/llama-stack/.venv/bin/python
cachedir: .pytest_cache
metadata: {'Python': '3.10.0', 'Platform': 'Linux-6.8.0-1021-gcp-x86_64-with-glibc2.35', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'nbval': '0.11.0', 'metadata': '3.1.1', 'anyio': '4.8.0', 'html': '4.1.1', 'asyncio': '0.25.3'}}
rootdir: /home/ubuntu/llama-stack
configfile: pyproject.toml
plugins: nbval-0.11.0, metadata-3.1.1, anyio-4.8.0, html-4.1.1, asyncio-0.25.3
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 2 items
tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_post_training_provider_registration[txt=8B] PASSED [ 50%]
tests/client-sdk/post_training/test_supervised_fine_tuning.py::test_list_training_jobs[txt=8B] PASSED [100%]
======================================================================================================================================================================== 2 passed, 1 warning in 0.10s ========================================================================================================================================================================
```
cc: @mattf @dglogo @sumitb
---------
Co-authored-by: Ubuntu <ubuntu@llama-stack-customizer-dev-inst-2tx95fyisatvlic4we8hidx5tfj.us-central1-a.c.brevdevprod.internal>
# What does this PR do?
This PR updates the sqlite-vec database calls to be non-blocking. Note
that each operation creates a new connection, which incurs some
performance overhead but is reasonable given [SQLite's threading and
connections constraints](https://www.sqlite.org/threadsafe.html).
Summary of changes:
- Refactored `SQLiteVecIndex` class to store database path instead of
connection object
- Added `_create_sqlite_connection()` helper function to create
connections on demand
- Ensured proper connection closure in all database operations
- Fixed test fixtures to use a file-based SQLite database for
thread-safety
- Updated the `SQLiteVecVectorIOAdapter` class to handle per-operation
connections
This PR helps chip away at
https://github.com/meta-llama/llama-stack/issues/1489
## Test Plan
sqlite-vec unit tests passed locally as well as a test script using the
client as a library.
## Misc
FYI @varshaprasad96 @kevincogan
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
**What**
Instead of adhoc creating a vectordb and chunking when documents ae sent
as an attachment to agent turn, we directly pass raw text from document
into messages to model for user context, and let model perform
summarization directly.
This removes the magic behaviour, and yields better performance than
existing approach.
**Improved Performance**
- RAG lifecycle notebook
- Model: 0.3 factuality score
- (+ websearch) Agent: 0.44 factuality score
- (+ vector db) Agent: 0.3 factuality score
- (+ raw context) Agent: 0.6 factuality score
Closes https://github.com/meta-llama/llama-stack/issues/1478
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
- [NEW] added section in RAG lifecycle notebook shows better performance
<img width="840" alt="image"
src="https://github.com/user-attachments/assets/a0c4e816-809a-41c0-9124-89825983e3f5"
/>
[//]: # (## Documentation)
# What does this PR do?
Gets rid of errors like the below, which is on all webmethod decorated
functions
llama_stack/apis/agents/agents.py:398: error: Value of type variable "T"
of function cannot be "Callable[[Agents, AgentConfig], Coroutine[Any,
Any, AgentCreateResponse]]" [type-var]
## Test Plan
Run mypy and observes mypy errors gone
# What does this PR do?
1) Uses otel compatible id generation for stack
2) Stack starts returning trace id info in the header of response
3) We inject the same trace id that we have into otel in order to force
it to use our trace ids.
## Test Plan
```
curl -i --request POST \
--url http://localhost:8321/v1/inference/chat-completion \
--header 'content-type: application/json' \
--data '{
"model_id": "meta-llama/Llama-3.1-70B-Instruct",
"messages": [
{
"role": "user",
"content": {
"type": "text",
"text": "where do humans live"
}
}
],
"stream": false
}'
HTTP/1.1 200 OK
date: Fri, 21 Mar 2025 21:51:19 GMT
server: uvicorn
content-length: 1712
content-type: application/json
x-trace-id: 595101ede31ece116ebe35b26d67e8cf
{"metrics":[{"metric":"prompt_tokens","value":10,"unit":null},{"metric":"completion_tokens","value":320,"unit":null},{"metric":"total_tokens","value":330,"unit":null}],"completion_message":{"role":"assistant","content":"Humans live on the planet Earth, specifically on its landmasses and in its oceans. Here's a breakdown of where humans live:\n\n1. **Continents:** Humans inhabit all seven continents:\n\t* Africa\n\t* Antarctica ( temporary residents, mostly scientists and researchers)\n\t* Asia\n\t* Australia\n\t* Europe\n\t* North America\n\t* South America\n2. **Countries:** There are 196 countries recognized by the United Nations, and humans live in almost all of them.\n3. **Cities and towns:** Many humans live in urban areas, such as cities and towns, which are often located near coastlines, rivers, or other bodies of water.\n4. **Rural areas:** Some humans live in rural areas, such as villages, farms, and countryside.\n5. **Islands:** Humans inhabit many islands around the world, including tropical islands, island nations, and islands in the Arctic and Antarctic regions.\n6. **Underwater habitats:** A few humans live in underwater habitats, such as research stations and submarines.\n7. **Space:** A small number of humans have lived in space, including astronauts on the International Space Station and those who have visited the Moon.\n\nIn terms of specific environments, humans live in a wide range of ecosystems, including:\n\n* Deserts\n* Forests\n* Grasslands\n* Mountains\n* Oceans\n* Rivers\n* Tundras\n* Wetlands\n\nOverall, humans are incredibly adaptable and can be found living in almost every corner of the globe.","stop_reason":"end_of_turn","tool_calls":[]},"logprobs":null}
```
Same trace id in Jaeger and sqlite:


# What does this PR do?
## Test Plan
LLAMA_STACK_CONFIG=dev pytest -s -v
tests/integration/agents/test_agents.py::test_custom_tool
--safety-shield meta-llama/Llama-Guard-3-8B --text-model
accounts/fireworks/models/llama-v3p1-8b-instruct
and verify trace in jaeger UI
https://llama-stack.readthedocs.io/en/latest/building_applications/telemetry.html#
# What does this PR do?
- We cannot directly return a literal type
> Note: this is not final jobs API change
[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])
## Test Plan
<img width="837" alt="image"
src="https://github.com/user-attachments/assets/18a17561-35f9-443d-987d-54afdd6ff40c"
/>
[//]: # (## Documentation)
Required to startup a distribution with prompt guard
Closes: #1723
## Test Plan
distribution starts with patch applied
Signed-off-by: Derek Higgins <derekh@redhat.com>