# What does this PR do?
adds a test for the completion api's logprobs parameter
tbd which providers pass this test
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
# What does this PR do?
add the completion api to the nvidia inference provider
## Test Plan
while running the meta/llama-3.1-8b-instruct NIM from
https://build.nvidia.com/meta/llama-3_1-8b-instruct?snippet_tab=Docker
```
➜ pytest -s -v --providers inference=nvidia llama_stack/providers/tests/inference/ --env NVIDIA_BASE_URL=http://localhost:8000 -k test_completion --inference-model Llama3.1-8B-Instruct
=============================================== test session starts ===============================================
platform linux -- Python 3.10.15, pytest-8.3.3, pluggy-1.5.0 -- /home/matt/.conda/envs/stack/bin/python
cachedir: .pytest_cache
rootdir: /home/matt/Documents/Repositories/meta-llama/llama-stack
configfile: pyproject.toml
plugins: anyio-4.6.2.post1, asyncio-0.24.0, httpx-0.34.0
asyncio: mode=strict, default_loop_scope=None
collected 20 items / 18 deselected / 2 selected
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-nvidia] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-nvidia] SKIPPED
============================= 1 passed, 1 skipped, 18 deselected, 6 warnings in 5.40s =============================
```
the structured output functionality works but the accuracy fails
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
# What does this PR do?
Addresses issue (#391)
- Adds json structured output for vLLM
- Enables structured output tests for vLLM
> Give me a recipe for Spaghetti Bolognaise:
```json
{
"recipe_name": "Spaghetti Bolognaise",
"preamble": "Ah, spaghetti bolognaise - the quintessential Italian dish that fills my kitchen with the aromas of childhood nostalgia. As a child, I would watch my nonna cook up a big pot of spaghetti bolognaise every Sunday, filling our small Italian household with the savory scent of simmering meat and tomatoes. The way the sauce would thicken and the spaghetti would al dente - it was love at first bite. And now, as a chef, I want to share that same love with you, so you can recreate these warm, comforting memories at home.",
"ingredients": [
"500g minced beef",
"1 medium onion, finely chopped",
"2 cloves garlic, minced",
"1 carrot, finely chopped",
" celery, finely chopped",
"1 (28 oz) can whole peeled tomatoes",
"1 tbsp tomato paste",
"1 tsp dried basil",
"1 tsp dried oregano",
"1 tsp salt",
"1/2 tsp black pepper",
"1/2 tsp sugar",
"1 lb spaghetti",
"Grated Parmesan cheese, for serving",
"Extra virgin olive oil, for serving"
],
"steps": [
"Heat a large pot over medium heat and add a generous drizzle of extra virgin olive oil.",
"Add the chopped onion, garlic, carrot, and celery and cook until the vegetables are soft and translucent, about 5-7 minutes.",
"Add the minced beef and cook until browned, breaking it up with a spoon as it cooks.",
"Add the tomato paste and cook for 1-2 minutes, stirring constantly.",
"Add the canned tomatoes, dried basil, dried oregano, salt, black pepper, and sugar. Stir well to combine.",
"Bring the sauce to a simmer and let it cook for 20-30 minutes, stirring occasionally, until the sauce has thickened and the flavors have melded together.",
"While the sauce cooks, bring a large pot of salted water to a boil and cook the spaghetti according to the package instructions until al dente. Reserve 1 cup of pasta water before draining the spaghetti.",
"Add the reserved pasta water to the sauce and stir to combine.",
"Combine the cooked spaghetti and sauce, tossing to coat the pasta evenly.",
"Serve hot, topped with grated Parmesan cheese and a drizzle of extra virgin olive oil.",
"Enjoy!"
]
}
```
Generated with Llama-3.2-3B-Instruct model - pretty good for a 3B
parameter model 👍
## Test Plan
`pytest -v -s
llama_stack/providers/tests/inference/test_text_inference.py -k
llama_3b-vllm_remote`
With the following setup:
```bash
# Environment
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
export INFERENCE_PORT=8000
export VLLM_URL=http://localhost:8000/v1
# vLLM server
sudo docker run --gpus all \
-v $STORAGE_DIR/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=$(cat ~/.cache/huggingface/token)" \
-p 8000:$INFERENCE_PORT \
--ipc=host \
--net=host \
vllm/vllm-openai:v0.6.3.post1 \
--model $INFERENCE_MODEL
# llama-stack server
llama stack build --template remote-vllm --image-type conda && llama stack run distributions/remote-vllm/run.yaml \
--port 5001 \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
```
Results:
```
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_3b-vllm_remote] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[llama_3b-vllm_remote] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_3b-vllm_remote] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_3b-vllm_remote] PASSED
================================ 6 passed, 2 skipped, 120 deselected, 2 warnings in 13.26s ================================
```
## Sources
- https://github.com/vllm-project/vllm/discussions/8300
- By default, vLLM uses https://github.com/dottxt-ai/outlines for
structured outputs
[[1](32e7db2536/vllm/engine/arg_utils.py (L279-L280))]
## Before submitting
[N/A] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case)
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
[N/A?] Updated relevant documentation. Couldn't find any relevant
documentation. Lmk if I've missed anything.
- [x] Wrote necessary unit or integration tests.
i find `test_structured_output` to be flakey. it's both a functionality
and accuracy test -
```
answer = AnswerFormat.model_validate_json(response.completion_message.content)
assert answer.first_name == "Michael"
assert answer.last_name == "Jordan"
assert answer.year_of_birth == 1963
assert answer.num_seasons_in_nba == 15
```
it's an accuracy test because it checks the value of first/last name,
birth year, and num seasons.
i find that -
- llama-3.1-8b-instruct and llama-3.2-3b-instruct pass the functionality
portion
- llama-3.2-3b-instruct consistently fails the accuracy portion
(thinking MJ was in the NBA for 14 seasons)
- llama-3.1-8b-instruct occasionally fails the accuracy portion
suggestions (not mutually exclusive) -
1. turn the test into functionality only, skip the value checks
2. split the test into a functionality version and an xfail accuracy
version
3. add context to the prompt so the llm can answer without accessing
embedded memory
# What does this PR do?
implements option (3) by adding context to the system prompt.
## Test Plan
`pytest -s -v ... llama_stack/providers/tests/inference/ ... -k
structured_output`
## Before submitting
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
# What does this PR do?
this PR adds a basic inference adapter to NVIDIA NIMs
what it does -
- chat completion api
- tool calls
- streaming
- structured output
- logprobs
- support hosted NIM on integrate.api.nvidia.com
- support downloaded NIM containers
what it does not do -
- completion api
- embedding api
- vision models
- builtin tools
- have certainty that sampling strategies are correct
## Feature/Issue validation/testing/test plan
`pytest -s -v --providers inference=nvidia
llama_stack/providers/tests/inference/ --env NVIDIA_API_KEY=...`
all tests should pass. there are pydantic v1 warnings.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
- [x] Did you write any new necessary tests?
Thanks for contributing 🎉!
# What does this PR do?
This PR kills the notion of "pure passthrough" remote providers. You
cannot specify a single provider you must specify a whole distribution
(stack) as remote.
This PR also significantly fixes / upgrades testing infrastructure so
you can now test against a remotely hosted stack server by just doing
```bash
pytest -s -v -m remote test_agents.py \
--inference-model=Llama3.1-8B-Instruct --safety-shield=Llama-Guard-3-1B \
--env REMOTE_STACK_URL=http://localhost:5001
```
Also fixed `test_agents_persistence.py` (which was broken) and killed
some deprecated testing functions.
## Test Plan
All the tests.
This PR changes the way model id gets translated to the final model name
that gets passed through the provider.
Major changes include:
1) Providers are responsible for registering an object and as part of
the registration returning the object with the correct provider specific
name of the model provider_resource_id
2) To help with the common look ups different names a new ModelLookup
class is created.
Tested all inference providers including together, fireworks, vllm,
ollama, meta reference and bedrock