# What does this PR do?
this PR adds a basic inference adapter to NVIDIA NIMs
what it does -
- chat completion api
- tool calls
- streaming
- structured output
- logprobs
- support hosted NIM on integrate.api.nvidia.com
- support downloaded NIM containers
what it does not do -
- completion api
- embedding api
- vision models
- builtin tools
- have certainty that sampling strategies are correct
## Feature/Issue validation/testing/test plan
`pytest -s -v --providers inference=nvidia
llama_stack/providers/tests/inference/ --env NVIDIA_API_KEY=...`
all tests should pass. there are pydantic v1 warnings.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Did you read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Was this discussed/approved via a Github issue? Please add a link
to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes?
- [x] Did you write any new necessary tests?
Thanks for contributing 🎉!
# What does this PR do?
Update the llama model supported list for Ollama.
- [x] Addresses issue (#462)
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
# What does this PR do?
This PR fixes some of the issues with our telemetry setup to enable logs
to be delivered to opentelemetry and jaeger. Main fixes
1) Updates the open telemetry provider to use the latest oltp exports
instead of deprected ones.
2) Adds a tracing middleware, which injects traces into each HTTP
request that the server recieves and this is going to be the root trace.
Previously, we did this in the create_dynamic_route method, which is
actually not the actual exectuion flow, but more of a config and this
causes the traces to end prematurely. Through middleware, we plugin the
trace start and end at the right location.
3) We manage our own methods to create traces and spans and this does
not fit well with Opentelemetry SDK since it does not support provide a
way to take in traces and spans that are already created. it expects us
to use the SDK to create them. For now, I have a hacky approach of just
maintaining a map from our internal telemetry objects to the open
telemetry specfic ones. This is not the ideal solution. I will explore
other ways to get around this issue. for now, to have something that
works, i am going to keep this as is.
Addresses: #509
# What does this PR do?
- modify openapi generator to add coming soon tag for unimplemented api
- sphinx-redocs extension for openapi spec to readthedocs page
## Test Plan
https://github.com/user-attachments/assets/b4c7eebc-2361-4198-a987-dbfbcff914cf
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [ ] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
- updated the notebooks to reflect past changes up to llama-stack 0.0.53
- updated readme to provide accurate and up-to-date info
- improve the current zero to hero by integrating an example using
together api
## Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
---------
Co-authored-by: Sanyam Bhutani <sanyambhutani@meta.com>
# What does this PR do?
Safety provider `inline::meta-reference` is now deprecated. However, we
* aren't checking / printing the deprecation message in `llama stack
build`
* make the deprecated (unusable) provider
So I (1) added checking and (2) made `inline::llama-guard` the default
## Test Plan
Before
```
Traceback (most recent call last):
File "/home/dalton/.conda/envs/nov22/bin/llama", line 8, in <module>
sys.exit(main())
File "/home/dalton/all/llama-stack/llama_stack/cli/llama.py", line 46, in main
parser.run(args)
File "/home/dalton/all/llama-stack/llama_stack/cli/llama.py", line 40, in run
args.func(args)
File "/home/dalton/all/llama-stack/llama_stack/cli/stack/build.py", line 177, in _run_stack_build_command
self._run_stack_build_command_from_build_config(build_config)
File "/home/dalton/all/llama-stack/llama_stack/cli/stack/build.py", line 305, in _run_stack_build_command_from_build_config
self._generate_run_config(build_config, build_dir)
File "/home/dalton/all/llama-stack/llama_stack/cli/stack/build.py", line 226, in _generate_run_config
config_type = instantiate_class_type(
File "/home/dalton/all/llama-stack/llama_stack/distribution/utils/dynamic.py", line 12, in instantiate_class_type
module = importlib.import_module(module_name)
File "/home/dalton/.conda/envs/nov22/lib/python3.10/importlib/__init__.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
File "<frozen importlib._bootstrap>", line 1004, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'llama_stack.providers.inline.safety.meta_reference'
```
After
```
Traceback (most recent call last):
File "/home/dalton/.conda/envs/nov22/bin/llama", line 8, in <module>
sys.exit(main())
File "/home/dalton/all/llama-stack/llama_stack/cli/llama.py", line 46, in main
parser.run(args)
File "/home/dalton/all/llama-stack/llama_stack/cli/llama.py", line 40, in run
args.func(args)
File "/home/dalton/all/llama-stack/llama_stack/cli/stack/build.py", line 177, in _run_stack_build_command
self._run_stack_build_command_from_build_config(build_config)
File "/home/dalton/all/llama-stack/llama_stack/cli/stack/build.py", line 309, in _run_stack_build_command_from_build_config
self._generate_run_config(build_config, build_dir)
File "/home/dalton/all/llama-stack/llama_stack/cli/stack/build.py", line 228, in _generate_run_config
raise InvalidProviderError(p.deprecation_error)
llama_stack.distribution.resolver.InvalidProviderError:
Provider `inline::meta-reference` for API `safety` does not work with the latest Llama Stack.
- if you are using Llama Guard v3, please use the `inline::llama-guard` provider instead.
- if you are using Prompt Guard, please use the `inline::prompt-guard` provider instead.
- if you are using Code Scanner, please use the `inline::code-scanner` provider instead.
```
<img width="469" alt="Screenshot 2024-11-22 at 4 10 24 PM"
src="https://github.com/user-attachments/assets/8c2e09fe-379a-4504-b246-7925f80a6ed6">
## Sources
Please link relevant resources if necessary.
## Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [ ] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [ ] Updated relevant documentation.
- [ ] Wrote necessary unit or integration tests.
# What does this PR do?
This PR moves all print statements to use logging. Things changed:
- Had to add `await start_trace("sse_generator")` to server.py to
actually get tracing working. else was not seeing any logs
- If no telemetry provider is provided in the run.yaml, we will write to
stdout
- by default, the logs are going to be in JSON, but we expose an option
to configure to output in a human readable way.
# What does this PR do?
Fix fp8 quantization script.
## Test Plan
```
sh run_quantize_checkpoint.sh localhost fp8 /home/yll/fp8_test/ /home/yll/fp8_test/quantized_2 /home/yll/fp8_test/tokenizer.model 1 1
```
## Sources
Please link relevant resources if necessary.
## Before submitting
- [x] This PR fixes a typo or improves the docs (you can dismiss the
other checks if that's the case).
- [x] Ran pre-commit to handle lint / formatting issues.
- [x] Read the [contributor
guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md),
Pull Request section?
- [x] Updated relevant documentation.
- [x] Wrote necessary unit or integration tests.
Co-authored-by: Yunlu Li <yll@meta.com>