# What does this PR do?
- Fix typo
- Support Llama 3.3 70B
## Test Plan
Run the following scripts and obtain the test results
Script
```
pytest -s -v --providers inference=sambanova llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming --env SAMBANOVA_API_KEY={API_KEY}
```
Result
```
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-sambanova] PASSED
=========================================== 1 passed, 1 warning in 1.26s ============================================
```
Script
```
pytest -s -v --providers inference=sambanova llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming --env SAMBANOVA_API_KEY={API_KEY}
```
Result
```
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-sambanova] PASSED
=========================================== 1 passed, 1 warning in 0.52s ============================================
```
## Sources
Please link relevant resources if necessary.
## Before submitting
- [N] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [Y] Ran pre-commit to handle lint / formatting issues.
- [Y] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [Y] Updated relevant documentation.
- [N] Wrote necessary unit or integration tests.
# What does this PR do?
1) As per @mattf's suggestion, we want to mark the pytest as xfail for
providers that do not support the functionality. In this diff, we xfail
the logProbs inference tests for providers who does not support log
probs.
( log probs is only supported by together, fireworks and vllm)
2) Added logProbs support for together according to their developer
[doc](https://docs.together.ai/docs/logprobs).
## Test Plan
1) Together & Fireworks
```
export LLAMA_STACK_CONFIG=/Users/sxyi/llama-stack/llama_stack/templates/together/run.yaml
/opt/miniconda3/envs/stack/bin/pytest -s -v /Users/sxyi/llama-stack/tests/client-sdk/inference/test_inference.py
```
```
tests/client-sdk/inference/test_inference.py::test_text_completion_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED
tests/client-sdk/inference/test_inference.py::test_completion_log_probs_non_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED
tests/client-sdk/inference/test_inference.py::test_completion_log_probs_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED
tests/client-sdk/inference/test_inference.py::test_text_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct] PASSED
tests/client-sdk/inference/test_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct-What are the names of planets in our solar system?-Earth] PASSED
tests/client-sdk/inference/test_inference.py::test_text_chat_completion_non_streaming[meta-llama/Llama-3.1-8B-Instruct-What are the names of the planets that have rings around them?-Saturn] PASSED
tests/client-sdk/inference/test_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.1-8B-Instruct-What's the name of the Sun in latin?-Sol] PASSED
tests/client-sdk/inference/test_inference.py::test_text_chat_completion_streaming[meta-llama/Llama-3.1-8B-Instruct-What is the name of the US captial?-Washington] PASSED
tests/client-sdk/inference/test_inference.py::test_text_chat_completion_with_tool_calling_and_non_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED
tests/client-sdk/inference/test_inference.py::test_text_chat_completion_with_tool_calling_and_streaming[meta-llama/Llama-3.1-8B-Instruct] PASSED
tests/client-sdk/inference/test_inference.py::test_text_chat_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct] PASSED
tests/client-sdk/inference/test_inference.py::test_image_chat_completion_non_streaming[meta-llama/Llama-3.2-11B-Vision-Instruct] PASSED
tests/client-sdk/inference/test_inference.py::test_image_chat_completion_streaming[meta-llama/Llama-3.2-11B-Vision-Instruct] PASSED
tests/client-sdk/inference/test_inference.py::test_image_chat_completion_base64_url[meta-llama/Llama-3.2-11B-Vision-Instruct] PASSED
========================================================================================== 15 passed, 2 warnings in 19.46s ===========================================================================================
```
```
export LLAMA_STACK_CONFIG=/Users/sxyi/llama-stack/llama_stack/templates/fireworks/run.yaml
/opt/miniconda3/envs/stack/bin/pytest -s -v /Users/sxyi/llama-stack/tests/client-sdk/inference/test_inference.py
```
All tests passed
2) Ollama - LogProbs tests are marked as xfailed.
```
tests/client-sdk/inference/test_inference.py::test_completion_log_probs_non_streaming[meta-llama/Llama-3.1-8B-Instruct] XFAIL (remote::ollama doesn't support log probs yet)
tests/client-sdk/inference/test_inference.py::test_completion_log_probs_streaming[meta-llama/Llama-3.1-8B-Instruct] XFAIL (remote::ollama doesn't support log probs yet)
```
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
fix type mismatch in /v1/inference/completion
## Test Plan
`llama stack run ./llama_stack/templates/nvidia/run.yaml`
`LLAMA_STACK_BASE_URL="http://localhost:8321" pytest -v
tests/client-sdk/inference/test_inference.py`
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
- Fix loading SambaNovaImpl issue
- Add LlamaGuard model support for inference
## Test Plan
Run the following unit test scripts and results
### Embedding
```
pytest -s -v --providers inference=sambanova llama_stack/providers/tests/inference/test_embeddings.py --inference-model meta-llama/Llama-3.2-11B-Vision-Instruct --env SAMBANOVA_API_KEY={SAMBANOVA_API_KEY}
```
```
llama_stack/providers/tests/inference/test_embeddings.py::TestEmbeddings::test_embeddings[-sambanova] SKIPPED (This test is only applicable for embedding models)
llama_stack/providers/tests/inference/test_embeddings.py::TestEmbeddings::test_batch_embeddings[-sambanova] SKIPPED (This test is only applicable for embedding models)
=================================================================================================================== 2 skipped, 1 warning in 0.32s ===================================================================================================================
```
### Vision
```
pytest -s -v --providers inference=sambanova llama_stack/providers/tests/inference/test_vision_inference.py --inference-model meta-llama/Llama-3.2-11B-Vision-Instruct --env SAMBANOVA_API_KEY={SAMBANOVA_API_KEY}
```
```
llama_stack/providers/tests/inference/test_vision_inference.py::TestVisionModelInference::test_vision_chat_completion_non_streaming[-sambanova-image0-expected_strings0] PASSED
llama_stack/providers/tests/inference/test_vision_inference.py::TestVisionModelInference::test_vision_chat_completion_non_streaming[-sambanova-image1-expected_strings1] PASSED
llama_stack/providers/tests/inference/test_vision_inference.py::TestVisionModelInference::test_vision_chat_completion_streaming[-sambanova] PASSED
=================================================================================================================== 3 passed, 1 warning in 2.68s ====================================================================================================================
```
### Text
```
pytest -s -v --providers inference=sambanova llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming --env SAMBANOVA_API_KEY={SAMBANOVA_API_KEY}
```
```
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-sambanova] PASSED
=================================================================================================================== 1 passed, 1 warning in 0.46s ====================================================================================================================
```
```
pytest -s -v --providers inference=sambanova llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming --env SAMBANOVA_API_KEY={SAMBANOVA_API_KEY}
```
```
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-sambanova] PASSED
=================================================================================================================== 1 passed, 1 warning in 0.48s ====================================================================================================================
```
## Before submitting
- [] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [Y] Ran pre-commit to handle lint / formatting issues.
- [Y] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [Y] Updated relevant documentation.
- [Y] Wrote necessary unit or integration tests.
# What does this PR do?
This PR adds SambaNova as one of the Provider
- Add SambaNova as a provider
## Test Plan
Test the functional command
```
pytest -s -v --providers inference=sambanova llama_stack/providers/tests/inference/test_embeddings.py llama_stack/providers/tests/inference/test_prompt_adapter.py llama_stack/providers/tests/inference/test_text_inference.py llama_stack/providers/tests/inference/test_vision_inference.py --env SAMBANOVA_API_KEY=<sambanova-api-key>
```
Test the distribution template:
```
# Docker
LLAMA_STACK_PORT=5001
docker run -it -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
llamastack/distribution-sambanova \
--port $LLAMA_STACK_PORT \
--env SAMBANOVA_API_KEY=$SAMBANOVA_API_KEY
# Conda
llama stack build --template sambanova --image-type conda
llama stack run ./run.yaml \
--port $LLAMA_STACK_PORT \
--env SAMBANOVA_API_KEY=$SAMBANOVA_API_KEY
```
## Source
[SambaNova API Documentation](https://cloud.sambanova.ai/apis)
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [Y] Ran pre-commit to handle lint / formatting issues.
- [Y] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [Y] Updated relevant documentation.
- [Y ] Wrote necessary unit or integration tests.
---------
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
Adds raw completions API to vLLM
## Test Plan
<details>
<summary>Setup</summary>
```bash
# Run vllm server
conda create -n vllm python=3.12 -y
conda activate vllm
pip install vllm
# Run llamastack
conda create --name llamastack-vllm python=3.10
conda activate llamastack-vllm
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct && \
pip install -e . && \
pip install --no-cache --index-url https://pypi.org/simple/ --extra-index-url https://test.pypi.org/simple/ llama-stack==0.1.0rc7 && \
llama stack build --template remote-vllm --image-type conda && \
llama stack run ./distributions/remote-vllm/run.yaml \
--port 5000 \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env VLLM_URL=http://localhost:8000/v1 | tee -a llama-stack.log
```
</details>
<details>
<summary>Integration</summary>
```bash
# Run
conda activate llamastack-vllm
export VLLM_URL=http://localhost:8000/v1
pip install pytest pytest_html pytest_asyncio aiosqlite
pytest llama_stack/providers/tests/inference/test_text_inference.py -v -k vllm
# Results
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm_remote] PASSED [ 11%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm_remote] PASSED [ 22%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm_remote] SKIPPED [ 33%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm_remote] SKIPPED [ 44%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm_remote] PASSED [ 55%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm_remote] PASSED [ 66%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm_remote] PASSED [ 77%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm_remote] PASSED [ 88%]
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm_remote] PASSED [100%]
====================================== 7 passed, 2 skipped, 99 deselected, 1 warning in 9.80s ======================================
```
</details>
<details>
<summary>Manual</summary>
```bash
# Install
pip install --no-cache --index-url https://pypi.org/simple/ --extra-index-url https://test.pypi.org/simple/ llama-stack==0.1.0rc7
```
Apply this diff
```diff
diff --git a/llama_stack/distribution/server/server.py b/llama_stack/distribution/server/server.py
index 8dbb193..95173e2 100644
--- a/llama_stack/distribution/server/server.py
+++ b/llama_stack/distribution/server/server.py
@@ -250,7 +250,7 @@ class ClientVersionMiddleware:
server_version_parts = tuple(
map(int, self.server_version.split(".")[:2])
)
- if client_version_parts != server_version_parts:
+ if False and client_version_parts != server_version_parts:
async def send_version_error(send):
await send(
diff --git a/llama_stack/templates/remote-vllm/run.yaml b/llama_stack/templates/remote-vllm/run.yaml
index 4eac4da..32eb50e 100644
--- a/llama_stack/templates/remote-vllm/run.yaml
+++ b/llama_stack/templates/remote-vllm/run.yaml
@@ -94,7 +94,8 @@ metadata_store:
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/remote-vllm}/registry.db
models:
-- metadata: {}
+- metadata:
+ llama_model: meta-llama/Llama-3.2-3B-Instruct
model_id: ${env.INFERENCE_MODEL}
provider_id: vllm-inference
model_type: llm
```
Test 1:
```python
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(
base_url="http://localhost:5000",
)
response = client.inference.completion(
model_id="meta-llama/Llama-3.2-3B-Instruct",
content="Hello, world client!",
)
print(response)
```
Test 2
```
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(
base_url="http://localhost:5000",
)
response = client.inference.completion(
model_id="meta-llama/Llama-3.2-3B-Instruct",
content="Hello, world client!",
stream=True,
)
for chunk in response:
print(chunk.delta, end="", flush=True)
```
```
I'm excited to introduce you to our latest project, a comprehensive guide to the best coffee shops in [City]. As a coffee connoisseur, you're in luck because we've scoured the city to bring you the top picks for the perfect cup of joe.
In this guide, we'll take you on a journey through the city's most iconic coffee shops, highlighting their unique features, must-try drinks, and insider tips from the baristas themselves. From cozy cafes to trendy cafes, we've got you covered.
**Top 5 Coffee Shops in [City]**
1. **The Daily Grind**: This beloved institution has been serving up expertly crafted pour-overs and lattes for over 10 years. Their expert baristas are always happy to guide you through their menu, which features a rotating selection of single-origin beans from around the world...
```
</details>
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
Some small updates to the inference types to make them more standard
Specifically:
- image data is now located in a "image" subkey
- similarly tool call data is located in a "tool_call" subkey
The pattern followed is `dict(type="foo", foo=<...>)`
Enable downloads before sending request to fireworks.
Test using --
`LLAMA_STACK_CONFIG=./llama_stack/templates/fireworks/run.yaml pytest -s
-v -k 'test_image_chat_completion_streaming' tests/client-sdk`
# What does this PR do?
1) enabled structured output for ollama /completion API. It seems we
missed this one.
2) fixed ollama structured output test in client sdk - ollama does not
support list format for structured output
3) enable structured output unit test as the result was stable on
Llama-3.1-8B-Instruct and ollama, fireworks, together.
## Test Plan
1) Run `test_completion_structured_output` on /completion API with 3
providers: ollama, fireworks, together.
pytest -v -s -k "together"
--inference-model="meta-llama/Llama-3.1-8B-Instruct"
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output
```
(base) sxyi@sxyi-mbp llama-stack % pytest -s -v llama_stack/providers/tests/inference --config=ci_test_config.yaml
/Library/Frameworks/Python.framework/Versions/3.13/lib/python3.13/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
================================================================================================ test session starts =================================================================================================
platform darwin -- Python 3.13.0, pytest-8.3.4, pluggy-1.5.0 -- /Library/Frameworks/Python.framework/Versions/3.13/bin/python3.13
cachedir: .pytest_cache
metadata: {'Python': '3.13.0', 'Platform': 'macOS-15.1.1-arm64-arm-64bit-Mach-O', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'asyncio': '0.24.0', 'html': '4.1.1', 'metadata': '3.1.1', 'md': '0.2.0', 'dependency': '0.6.0', 'md-report': '0.6.3', 'anyio': '4.6.2.post1'}}
rootdir: /Users/sxyi/llama-stack
configfile: pyproject.toml
plugins: asyncio-0.24.0, html-4.1.1, metadata-3.1.1, md-0.2.0, dependency-0.6.0, md-report-0.6.3, anyio-4.6.2.post1
asyncio: mode=Mode.STRICT, default_loop_scope=None
collected 85 items / 82 deselected / 3 selected
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct-ollama] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct-fireworks]
PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[meta-llama/Llama-3.1-8B-Instruct-together] PASSED
==================================================================================== 3 passed, 82 deselected, 8 warnings in 5.67s ====================================================================================
```
2)
` LLAMA_STACK_CONFIG="./llama_stack/templates/ollama/run.yaml"
/opt/miniconda3/envs/stack/bin/pytest -s -v tests/client-sdk/inference`
Before:
```
________________________________________________________________________________________ test_completion_structured_output __________________________________________________________________________________________
tests/client-sdk/inference/test_inference.py:174: in test_completion_structured_output
answer = AnswerFormat.model_validate_json(response.content)
E pydantic_core._pydantic_core.ValidationError: 1 validation error for AnswerFormat
E Invalid JSON: expected value at line 1 column 2 [type=json_invalid, input_value=' The year he retired, he...5\n\nThe best answer is', input_type=str]
E For further information visit https://errors.pydantic.dev/2.10/v/json_invalid
```
After:
test consistently passes
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
- previous fix introduced regression for non base64 image
- add back download, and base64 check
## Test Plan
<img width="835" alt="image"
src="https://github.com/user-attachments/assets/b70bf725-035a-4b42-b492-53daaf71458a"
/>
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
- fix base64 based image url for vllm
- add a test case for base64 based image_url
- fixes issue: https://github.com/meta-llama/llama-stack/issues/571
## Test Plan
```
LLAMA_STACK_BASE_URL=http://localhost:8321 pytest -v ./tests/client-sdk/inference/test_inference.py::test_image_chat_completion_base64_url
```
<img width="991" alt="image"
src="https://github.com/user-attachments/assets/d56381ba-6777-4d23-9da9-81f73ce93566"
/>
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
- Fix TGI adapter
## Test Plan
<img width="851" alt="image"
src="https://github.com/user-attachments/assets/0084cbc6-6713-4079-b87b-0befd9aca0b0"
/>
- most inference working
- agent test failure due to model outputs
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
- add completion log probs for fireworks
## Test Plan
<img width="849" alt="image"
src="https://github.com/user-attachments/assets/5aa1f27f-02a6-422c-8478-94dd1e345342"
/>
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
- fixes to nvidia inference provider to account for strategy update
- update nvidia templates
## Test Plan
```
llama stack run ./llama_stack/templates/nvidia/run.yaml --port 5000
LLAMA_STACK_BASE_URL="http://localhost:5000" pytest -v tests/client-sdk/inference/test_inference.py --html=report.html --self-contained-html
```
<img width="1288" alt="image"
src="https://github.com/user-attachments/assets/d20f9aea-525e-47de-a5be-586e022e0d55"
/>
**NOTE**
- vision inference broken
- tool calling broken
- /completion broken
cc @mattf @cdgamarose-nv for improving NVIDIA inference adapter
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
adds nvidia template for creating a distribution using inference adapter
for NVIDIA NIMs.
## Test Plan
Please describe:
Build llama stack distribution for nvidia using the template, docker and
conda.
```bash
(.venv) local-cdgamarose@a4u8g-0006:~/llama-stack$ llama-stack-client configure --endpoint http://localhost:5000
Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:5000
(.venv) local-cdgamarose@a4u8g-0006:~/llama-stack$ llama-stack-client models list
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ identifier ┃ provider_id ┃ provider_resource_id ┃ metadata ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ Llama3.1-8B-Instruct │ nvidia │ meta/llama-3.1-8b-instruct │ {} │
│ meta-llama/Llama-3.2-3B-Instruct │ nvidia │ meta/llama-3.2-3b-instruct │ {} │
└──────────────────────────────────┴─────────────┴────────────────────────────┴──────────┘
(.venv) local-cdgamarose@a4u8g-0006:~/llama-stack$ llama-stack-client inference chat-completion --message "hello, write me a 2 sentence poem"
ChatCompletionResponse(
completion_message=CompletionMessage(
content='Here is a 2 sentence poem:\n\nThe sun sets slow and paints the sky, \nA gentle hue of pink that makes me sigh.',
role='assistant',
stop_reason='end_of_turn',
tool_calls=[]
),
logprobs=None
)
```
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [x] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
---------
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
# What does this PR do?
Cleans up how we provide sampling params. Earlier, strategy was an enum
and all params (top_p, temperature, top_k) across all strategies were
grouped. We now have a strategy union object with each strategy (greedy,
top_p, top_k) having its corresponding params.
Earlier,
```
class SamplingParams:
strategy: enum ()
top_p, temperature, top_k and other params
```
However, the `strategy` field was not being used in any providers making
it confusing to know the exact sampling behavior purely based on the
params since you could pass temperature, top_p, top_k and how the
provider would interpret those would not be clear.
Hence we introduced -- a union where the strategy and relevant params
are all clubbed together to avoid this confusion.
Have updated all providers, tests, notebooks, readme and otehr places
where sampling params was being used to use the new format.
## Test Plan
`pytest llama_stack/providers/tests/inference/groq/test_groq_utils.py`
// inference on ollama, fireworks and together
`with-proxy pytest -v -s -k "ollama"
--inference-model="meta-llama/Llama-3.1-8B-Instruct"
llama_stack/providers/tests/inference/test_text_inference.py `
// agents on fireworks
`pytest -v -s -k 'fireworks and create_agent'
--inference-model="meta-llama/Llama-3.1-8B-Instruct"
llama_stack/providers/tests/agents/test_agents.py
--safety-shield="meta-llama/Llama-Guard-3-8B"`
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [X] Ran pre-commit to handle lint / formatting issues.
- [X] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [X] Updated relevant documentation.
- [X] Wrote necessary unit or integration tests.
---------
Co-authored-by: Hardik Shah <hjshah@fb.com>
# What does this PR do?
Fix https://github.com/meta-llama/llama-stack/issues/697
## Test Plan
Run the 405b model. the full `accounts/fireworks/models/<model_id>` is
the full model name for Fireworks, the 'fireworks/<model_id>' is just a
short hand and sometimes have routing issues
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
We are setting a default value of json for tool prompt format, which
conflicts with llama 3.2/3.3 models since they use python list. This PR
changes the defaults to None and in the code, we infer default based on
the model.
Addresses: #695
Tests:
❯ LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v
tests/client-sdk/inference/test_inference.py -k
"test_text_chat_completion"
pytest llama_stack/providers/tests/inference/test_prompt_adapter.py
Add another header so client SDKs can identify their versions which can
be used for immediate detection of possible compatibility issues. A
semver mismatch against the wrong server should be immediately flagged
and requests should be denied.
Also change `X-LlamaStack-ProviderData` to `X-LlamaStack-Provider-Data`
since that hyphenation is better.
# What does this PR do?
PR #639 introduced the notion of Tools API and ability to invoke tools
through API just as any resource. This PR changes the Agents to start
using the Tools API to invoke tools. Major changes include:
1) Ability to specify tool groups with AgentConfig
2) Agent gets the corresponding tool definitions for the specified tools
and pass along to the model
3) Attachements are now named as Documents and their behavior is mostly
unchanged from user perspective
4) You can specify args that can be injected to a tool call through
Agent config. This is especially useful in case of memory tool, where
you want the tool to operate on a specific memory bank.
5) You can also register tool groups with args, which lets the agent
inject these as well into the tool call.
6) All tests have been migrated to use new tools API and fixtures
including client SDK tests
7) Telemetry just works with tools API because of our trace protocol
decorator
## Test Plan
```
pytest -s -v -k fireworks llama_stack/providers/tests/agents/test_agents.py \
--safety-shield=meta-llama/Llama-Guard-3-8B \
--inference-model=meta-llama/Llama-3.1-8B-Instruct
pytest -s -v -k together llama_stack/providers/tests/tools/test_tools.py \
--safety-shield=meta-llama/Llama-Guard-3-8B \
--inference-model=meta-llama/Llama-3.1-8B-Instruct
LLAMA_STACK_CONFIG="/Users/dineshyv/.llama/distributions/llamastack-together/together-run.yaml" pytest -v tests/client-sdk/agents/test_agents.py
```
run.yaml:
https://gist.github.com/dineshyv/0365845ad325e1c2cab755788ccc5994
Notebook:
https://colab.research.google.com/drive/1ck7hXQxRl6UvT-ijNRZ-gMZxH1G3cN2d?usp=sharing
# What does this PR do?
- add llama3.3 model for together
- fix fireworks distro_codegen
```
python llama_stack/scripts/distro_codegen.py
```
## Test Plan
<img width="1132" alt="image"
src="https://github.com/user-attachments/assets/bf94b933-9200-4e73-878e-d1a95d450a88"
/>
**Tests**
```
pytest -v -s -k "together" --inference-model="meta-llama/Llama-3.3-70B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py
```
<img width="1139" alt="image"
src="https://github.com/user-attachments/assets/407dc98b-8de3-4841-8cb1-75e4b5128544"
/>
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
Contributes towards: #432
RE: https://github.com/meta-llama/llama-stack/pull/609
I missed this one while refactoring. Fixes:
```python
Traceback (most recent call last):
File "/Users/aidand/dev/llama-stack/llama_stack/distribution/server/server.py", line 191, in endpoint
return await maybe_await(value)
File "/Users/aidand/dev/llama-stack/llama_stack/distribution/server/server.py", line 155, in maybe_await
return await value
File "/Users/aidand/dev/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
result = await method(self, *args, **kwargs)
File "/Users/aidand/dev/llama-stack/llama_stack/distribution/routers/routers.py", line 156, in chat_completion
return await provider.chat_completion(**params)
File "/Users/aidand/dev/llama-stack/llama_stack/providers/utils/telemetry/trace_protocol.py", line 101, in async_wrapper
result = await method(self, *args, **kwargs)
File "/Users/aidand/dev/llama-stack/llama_stack/providers/remote/inference/groq/groq.py", line 127, in chat_completion
response = self._get_client().chat.completions.create(**request)
File "/Users/aidand/dev/llama-stack/llama_stack/providers/remote/inference/groq/groq.py", line 143, in _get_client
return Groq(api_key=self.config.api_key)
AttributeError: 'GroqInferenceAdapter' object has no attribute 'config'. Did you mean: '_config'?
```
## Test Plan
Environment:
```shell
export GROQ_API_KEY=<api-key>
# build.yaml and run.yaml files
wget https://raw.githubusercontent.com/aidando73/llama-stack/9165502582cd7cb178bc1dcf89955b45768ab6c1/build.yaml
wget https://raw.githubusercontent.com/aidando73/llama-stack/9165502582cd7cb178bc1dcf89955b45768ab6c1/run.yaml
# Create environment if not already
conda create --prefix ./envs python=3.10
conda activate ./envs
# Build
pip install -e . && llama stack build --config ./build.yaml --image-type conda
# Activate built environment
conda activate llamastack-groq
```
<details>
<summary>Manual</summary>
```bash
llama stack run ./run.yaml --port 5001
```
Via this Jupyter notebook:
9165502582/hello.ipynb
</details>
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [x] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
Addresses issue #679
- Adds support for the response_format field for chat completions and
completions so users can get their outputs in JSON
## Test Plan
<details>
<summary>Integration tests</summary>
`pytest
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output
-k ollama -s -v`
```python
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-ollama] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-ollama] PASSED
================================== 2 passed, 18 deselected, 3 warnings in 41.41s ==================================
```
</details>
<details>
<summary>Manual Tests</summary>
```
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export OLLAMA_INFERENCE_MODEL=llama3.2:3b-instruct-fp16
export LLAMA_STACK_PORT=5000
ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m
llama stack build --template ollama --image-type conda
llama stack run ./run.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://localhost:11434
```
```python
client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")
MODEL_ID=meta-llama/Llama-3.2-3B-Instruct
prompt =f"""
Create a step by step plan to complete the task of creating a codebase that is a web server that has an API endpoint that translates text from English to French.
You have 3 different operations you can perform. You can create a file, update a file, or delete a file.
Limit your step by step plan to only these operations per step.
Don't create more than 10 steps.
Please ensure there's a README.md file in the root of the codebase that describes the codebase and how to run it.
Please ensure there's a requirements.txt file in the root of the codebase that describes the dependencies of the codebase.
"""
response = client.inference.chat_completion(
model_id=MODEL_ID,
messages=[
{"role": "user", "content": prompt},
],
sampling_params={
"max_tokens": 200000,
},
response_format={
"type": "json_schema",
"json_schema": {
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Plan",
"description": f"A plan to complete the task of creating a codebase that is a web server that has an API endpoint that translates text from English to French.",
"type": "object",
"properties": {
"steps": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": ["steps"],
"additionalProperties": False,
}
},
stream=True,
)
content = ""
for chunk in response:
if chunk.event.delta:
print(chunk.event.delta, end="", flush=True)
content += chunk.event.delta
try:
plan = json.loads(content)
print(plan)
except Exception as e:
print(f"Error parsing plan into JSON: {e}")
plan = {"steps": []}
```
Outputs:
```json
{
"steps": [
"Update the requirements.txt file to include the updated dependencies specified in the peer's feedback, including the Google Cloud Translation API key.",
"Update the app.py file to address the code smells and incorporate the suggested improvements, such as handling errors and exceptions, initializing the Translator object correctly, adding input validation, using type hints and docstrings, and removing unnecessary logging statements.",
"Create a README.md file that describes the codebase and how to run it.",
"Ensure the README.md file is up-to-date and accurate.",
"Update the requirements.txt file to reflect any additional dependencies specified by the peer's feedback.",
"Add documentation for each function in the app.py file using docstrings.",
"Implement logging statements throughout the app.py file to monitor application execution.",
"Test the API endpoint to ensure it correctly translates text from English to French and handles errors properly.",
"Refactor the code to follow PEP 8 style guidelines and ensure consistency in naming conventions, indentation, and spacing.",
"Create a new folder for logs and add a logging configuration file (e.g., logconfig.json) that specifies the logging level and output destination.",
"Deploy the web server on a production environment (e.g., AWS Elastic Beanstalk or Google Cloud Platform) to make it accessible to external users."
]
}
```
</details>
## Sources
- Ollama api docs:
https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion
- Ollama structured output docs:
https://github.com/ollama/ollama/blob/main/docs/api.md#request-structured-outputs
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
# What does this PR do?
- Makes Llama 70B 3.3 available for fireworks
## Test Plan
```shell
pip install -e . \
&& llama stack build --config distributions/fireworks/build.yaml --image-type conda \
&& llama stack run distributions/fireworks/run.yaml \
--port 5000
```
```python
response = client.inference.chat_completion(
model_id="Llama3.3-70B-Instruct",
messages=[
{"role": "user", "content": "hello world"},
],
)
```
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
In short, provide a summary of what this PR does and why. Usually, the
relevant context should be present in a linked issue.
- [x] Addresses issue (#issue)
```
from .nvidia import NVIDIAInferenceAdapter
File "/localhome/local-cdgamarose/llama-stack/llama_stack/providers/remote/inference/nvidia/nvidia.py", line 37, in <module>
from .openai_utils import (
File "/localhome/local-cdgamarose/llama-stack/llama_stack/providers/remote/inference/nvidia/openai_utils.py", line 11, in <module>
from llama_models.llama3.api.datatypes import (
ImportError: cannot import name 'CompletionMessage' from 'llama_models.llama3.api.datatypes' (/localhome/local-cdgamarose/.local/lib/python3.10/site-packages/llama_models/llama3/api/datatypes.py)
++ error_handler 62
```
## Test Plan
Deploy NIM using docker from
https://build.nvidia.com/meta/llama-3_1-8b-instruct?snippet_tab=Docker
```
(lsmyenv) local-cdgamarose@a4u8g-0006:~/llama-stack$ python3 -m pytest -s -v --providers inference=nvidia llama_stack/providers/tests/inference/ --env NVIDIA_BASE_URL=http://localhost:8000 -k test_completion --inference-model Llama3.1-8B-Instruct
======================================================================================== test session starts =========================================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /localhome/local-cdgamarose/anaconda3/envs/lsmyenv/bin/python3
cachedir: .pytest_cache
rootdir: /localhome/local-cdgamarose/llama-stack
configfile: pyproject.toml
plugins: anyio-4.7.0, asyncio-0.25.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 24 items / 21 deselected / 3 selected
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-nvidia] Initializing NVIDIAInferenceAdapter(http://localhost:8000)...
Checking NVIDIA NIM health...
Checking NVIDIA NIM health...
PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-nvidia] SKIPPED (Other inference providers don't support completion() yet)
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-nvidia] SKIPPED (This test is not quite robust)
====================================================================== 1 passed, 2 skipped, 21 deselected, 2 warnings in 1.57s =======================================================================
```
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
# What does this PR do?
Cerebras is rolling out support for llama 3.3 70b and deprecating llama
3.1 70b. This PR updates the documentation, config, and internal mapping
to reflect this change.
cc: @ashwinb @raghotham
## What does this PR do?
This is a long-pending change and particularly important to get done
now.
Specifically:
- we cannot "localize" (aka download) any URLs from media attachments
anywhere near our modeling code. it must be done within llama-stack.
- `PIL.Image` is infesting all our APIs via `ImageMedia ->
InterleavedTextMedia` and that cannot be right at all. Anything in the
API surface must be "naturally serializable". We need a standard `{
type: "image", image_url: "<...>" }` which is more extensible
- `UserMessage`, `SystemMessage`, etc. are moved completely to
llama-stack from the llama-models repository.
See https://github.com/meta-llama/llama-models/pull/244 for the
corresponding PR in llama-models.
## Test Plan
```bash
cd llama_stack/providers/tests
pytest -s -v -k "fireworks or ollama or together" inference/test_vision_inference.py
pytest -s -v -k "(fireworks or ollama or together) and llama_3b" inference/test_text_inference.py
pytest -s -v -k chroma memory/test_memory.py \
--env EMBEDDING_DIMENSION=384 --env CHROMA_DB_PATH=/tmp/foobar
pytest -s -v -k fireworks agents/test_agents.py \
--safety-shield=meta-llama/Llama-Guard-3-8B \
--inference-model=meta-llama/Llama-3.1-8B-Instruct
```
Updated the client sdk (see PR ...), installed the SDK in the same
environment and then ran the SDK tests:
```bash
cd tests/client-sdk
LLAMA_STACK_CONFIG=together pytest -s -v agents/test_agents.py
LLAMA_STACK_CONFIG=ollama pytest -s -v memory/test_memory.py
# this one needed a bit of hacking in the run.yaml to ensure I could register the vision model correctly
INFERENCE_MODEL=llama3.2-vision:latest LLAMA_STACK_CONFIG=ollama pytest -s -v inference/test_inference.py
```
# What does this PR do?
**Why**
- When AgentConfig has no `input_shields` / `output_shields` defined, we
still outputs a shield_call step with violation=None. This is impossible
to distinguish the case b/w (1) no violation from running shields v.s.
(2) no shields call
**What**
- We should not have a shield_call step when no `input_shields` /
`output_shields` are defined.
- Also removes a never reached try/catch code block in agent loop.
`run_multiple_shields` is never called in the try block (verified by
stacktrace print)
**Side Note**
- pre-commit fix
## Test Plan
Tested w/ DirectClient via:
https://gist.github.com/yanxi0830/b48f2a53b6f5391b9ff1e39992bc05b3
**No Shields**
<img width="858" alt="image"
src="https://github.com/user-attachments/assets/67319370-329f-4954-bd16-d21ce54c6ebf"
/>
**With Input + Output Shields**
<img width="854" alt="image"
src="https://github.com/user-attachments/assets/75ab1bee-3ba9-4549-ab51-23210be83da7"
/>
**Input Shields Only**
<img width="858" alt="image"
src="https://github.com/user-attachments/assets/1897206b-13dd-4ea5-92c2-b39bf68e9286"
/>
E2E pytest
```
LLAMA_STACK_BASE_URL=http://localhost:5000 pytest -v ./tests/client-sdk/agents/test_agents.py
```
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
Adds the sentence transformer provider and the `all-MiniLM-L6-v2`
embedding model to the default models to register in the run.yaml for
all providers.
## Test Plan
llama stack build --template together --image-type conda
llama stack run
~/.llama/distributions/llamastack-together/together-run.yaml
This PR does the following:
1) adds the ability to generate embeddings in all supported inference
providers.
2) Moves all the memory providers to use the inference API and improved
the memory tests to setup the inference stack correctly and use the
embedding models
This is a merge from #589 and #598
# What does this PR do?
add the completion api to the nvidia inference provider
## Test Plan
while running the meta/llama-3.1-8b-instruct NIM from
https://build.nvidia.com/meta/llama-3_1-8b-instruct?snippet_tab=Docker
```
➜ pytest -s -v --providers inference=nvidia llama_stack/providers/tests/inference/ --env NVIDIA_BASE_URL=http://localhost:8000 -k test_completion --inference-model Llama3.1-8B-Instruct
=============================================== test session starts ===============================================
platform linux -- Python 3.10.15, pytest-8.3.3, pluggy-1.5.0 -- /home/matt/.conda/envs/stack/bin/python
cachedir: .pytest_cache
rootdir: /home/matt/Documents/Repositories/meta-llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.6.2.post1, asyncio-0.24.0, httpx-0.34.0
asyncio: mode=strict, default_loop_scope=None
collected 20 items / 18 deselected / 2 selected
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-nvidia] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-nvidia] SKIPPED
============================= 1 passed, 1 skipped, 18 deselected, 6 warnings in 5.40s =============================
```
the structured output functionality works but the accuracy fails
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
# What does this PR do?
Addresses issue (#391)
- Adds json structured output for vLLM
- Enables structured output tests for vLLM
> Give me a recipe for Spaghetti Bolognaise:
```json
{
"recipe_name": "Spaghetti Bolognaise",
"preamble": "Ah, spaghetti bolognaise - the quintessential Italian dish that fills my kitchen with the aromas of childhood nostalgia. As a child, I would watch my nonna cook up a big pot of spaghetti bolognaise every Sunday, filling our small Italian household with the savory scent of simmering meat and tomatoes. The way the sauce would thicken and the spaghetti would al dente - it was love at first bite. And now, as a chef, I want to share that same love with you, so you can recreate these warm, comforting memories at home.",
"ingredients": [
"500g minced beef",
"1 medium onion, finely chopped",
"2 cloves garlic, minced",
"1 carrot, finely chopped",
" celery, finely chopped",
"1 (28 oz) can whole peeled tomatoes",
"1 tbsp tomato paste",
"1 tsp dried basil",
"1 tsp dried oregano",
"1 tsp salt",
"1/2 tsp black pepper",
"1/2 tsp sugar",
"1 lb spaghetti",
"Grated Parmesan cheese, for serving",
"Extra virgin olive oil, for serving"
],
"steps": [
"Heat a large pot over medium heat and add a generous drizzle of extra virgin olive oil.",
"Add the chopped onion, garlic, carrot, and celery and cook until the vegetables are soft and translucent, about 5-7 minutes.",
"Add the minced beef and cook until browned, breaking it up with a spoon as it cooks.",
"Add the tomato paste and cook for 1-2 minutes, stirring constantly.",
"Add the canned tomatoes, dried basil, dried oregano, salt, black pepper, and sugar. Stir well to combine.",
"Bring the sauce to a simmer and let it cook for 20-30 minutes, stirring occasionally, until the sauce has thickened and the flavors have melded together.",
"While the sauce cooks, bring a large pot of salted water to a boil and cook the spaghetti according to the package instructions until al dente. Reserve 1 cup of pasta water before draining the spaghetti.",
"Add the reserved pasta water to the sauce and stir to combine.",
"Combine the cooked spaghetti and sauce, tossing to coat the pasta evenly.",
"Serve hot, topped with grated Parmesan cheese and a drizzle of extra virgin olive oil.",
"Enjoy!"
]
}
```
Generated with Llama-3.2-3B-Instruct model - pretty good for a 3B
parameter model 👍
## Test Plan
`pytest -v -s
llama_stack/providers/tests/inference/test_text_inference.py -k
llama_3b-vllm_remote`
With the following setup:
```bash
# Environment
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export INFERENCE_PORT=8000
export VLLM_URL=http://localhost:8000/v1
# vLLM server
sudo docker run --gpus all \
-v $STORAGE_DIR/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=$(cat ~/.cache/huggingface/token)" \
-p 8000:$INFERENCE_PORT \
--ipc=host \
--net=host \
vllm/vllm-openai:v0.6.3.post1 \
--model $INFERENCE_MODEL
# llama-stack server
llama stack build --template remote-vllm --image-type conda && llama stack run distributions/remote-vllm/run.yaml \
--port 5001 \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
```
Results:
```
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_3b-vllm_remote] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[llama_3b-vllm_remote] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_3b-vllm_remote] PASSED
================================ 6 passed, 2 skipped, 120 deselected, 2 warnings in 13.26s ================================
```
## Sources
- https://github.com/vllm-project/vllm/discussions/8300
- By default, vLLM uses https://github.com/dottxt-ai/outlines for
structured outputs
[[1](32e7db2536/vllm/engine/arg_utils.py (L279-L280))]
## Before submitting
[N/A] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case)
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
[N/A?] Updated relevant documentation. Couldn't find any relevant
documentation. Lmk if I've missed anything.
- [x] Wrote necessary unit or integration tests.
This PR does a few things:
- it moves "direct client" to llama-stack repo instead of being in the
llama-stack-client-python repo
- renames it to `LlamaStackLibraryClient`
- actually makes synchronous generators work
- makes streaming and non-streaming work properly
In many ways, this PR makes things finally "work"
## Test Plan
See a `library_client_test.py` I added. This isn't really quite a test
yet but it demonstrates that this mode now works. Here's the invocation
and the response:
```
INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct python llama_stack/distribution/tests/library_client_test.py ollama
```
