i find `test_structured_output` to be flakey. it's both a functionality
and accuracy test -
```
answer = AnswerFormat.model_validate_json(response.completion_message.content)
assert answer.first_name == "Michael"
assert answer.last_name == "Jordan"
assert answer.year_of_birth == 1963
assert answer.num_seasons_in_nba == 15
```
it's an accuracy test because it checks the value of first/last name,
birth year, and num seasons.
i find that -
- llama-3.1-8b-instruct and llama-3.2-3b-instruct pass the functionality
portion
- llama-3.2-3b-instruct consistently fails the accuracy portion
(thinking MJ was in the NBA for 14 seasons)
- llama-3.1-8b-instruct occasionally fails the accuracy portion
suggestions (not mutually exclusive) -
1. turn the test into functionality only, skip the value checks
2. split the test into a functionality version and an xfail accuracy
version
3. add context to the prompt so the llm can answer without accessing
embedded memory
# What does this PR do?
implements option (3) by adding context to the system prompt.
## Test Plan
`pytest -s -v ... llama_stack/providers/tests/inference/ ... -k
structured_output`
## Before submitting
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
# What does this PR do?
this PR adds a basic inference adapter to NVIDIA NIMs
what it does -
- chat completion api
- tool calls
- streaming
- structured output
- logprobs
- support hosted NIM on integrate.api.nvidia.com
- support downloaded NIM containers
what it does not do -
- completion api
- embedding api
- vision models
- builtin tools
- have certainty that sampling strategies are correct
## Feature/Issue validation/testing/test plan
`pytest -s -v --providers inference=nvidia
llama_stack/providers/tests/inference/ --env NVIDIA_API_KEY=...`
all tests should pass. there are pydantic v1 warnings.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
- [x] Did you write any new necessary tests?
Thanks for contributing 🎉!
# What does this PR do?
This PR kills the notion of "pure passthrough" remote providers. You
cannot specify a single provider you must specify a whole distribution
(stack) as remote.
This PR also significantly fixes / upgrades testing infrastructure so
you can now test against a remotely hosted stack server by just doing
```bash
pytest -s -v -m remote test_agents.py \
--inference-model=Llama3.1-8B-Instruct --safety-shield=Llama-Guard-3-1B \
--env REMOTE_STACK_URL=http://localhost:5001
```
Also fixed `test_agents_persistence.py` (which was broken) and killed
some deprecated testing functions.
## Test Plan
All the tests.
This PR changes the way model id gets translated to the final model name
that gets passed through the provider.
Major changes include:
1) Providers are responsible for registering an object and as part of
the registration returning the object with the correct provider specific
name of the model provider_resource_id
2) To help with the common look ups different names a new ModelLookup
class is created.
Tested all inference providers including together, fireworks, vllm,
ollama, meta reference and bedrock