llama-stack

1733 commits 21 branches 64 tags 62 MiB

Author	SHA1	Message	Date
Ashwin Bharambe	d072b5fa0c	test: add unit test to ensure all config types are instantiable (#1601 )	2025-03-12 22:29:58 -07:00
Fred Reiss	a8d0cdaf37	feat: updated inline vllm inference provider (#880 ) # What does this PR do? This PR updates the inline vLLM inference provider in several significant ways: * Models are now attached at run time to instances of the provider via the `.../models` API instead of hard-coding the model's full name into the provider's YAML configuration. * The provider supports models that are not Meta Llama models. Any model that vLLM supports can be loaded by passing Huggingface coordinates in the "provider_model_id" field. Custom fine-tuned versions of Meta Llama models can be loaded by specifying a path on local disk in the "provider_model_id". * To implement full chat completions support, including tool calling and constrained decoding, the provider now routes the `chat_completions` API to a captive (i.e. called directly in-process, not via HTTPS) instance of vLLM's OpenAI-compatible server . * The `logprobs` parameter and completions API are also working. ## Test Plan Existing tests in `llama_stack/providers/tests/inference/test_text_inference.py` have good coverage of the new functionality. These tests can be invoked as follows: ``` cd llama-stack && pytest \ -vvv \ llama_stack/providers/tests/inference/test_text_inference.py \ --providers inference=vllm \ --inference-model meta-llama/Llama-3.2-3B-Instruct ====================================== test session starts ====================================== platform linux -- Python 3.12.8, pytest-8.3.4, pluggy-1.5.0 -- /mnt/datadisk1/freiss/llama/env/bin/python3.12 cachedir: .pytest_cache metadata: {'Python': '3.12.8', 'Platform': 'Linux-6.8.0-1016-ibm-x86_64-with-glibc2.39', 'Packages': {'pytest': '8.3.4', 'pluggy': '1.5.0'}, 'Plugins': {'anyio': '4.8.0', 'html': '4.1.1', 'metadata': '3.1.1', 'asyncio': '0.25.2'}, 'JAVA_HOME': '/usr/lib/jvm/java-8-openjdk-amd64'} rootdir: /mnt/datadisk1/freiss/llama/llama-stack configfile: pyproject.toml plugins: anyio-4.8.0, html-4.1.1, metadata-3.1.1, asyncio-0.25.2 asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None collected 9 items llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[-vllm] PASSED [ 11%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[-vllm] PASSED [ 22%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_logprobs[-vllm] PASSED [ 33%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion_structured_output[-vllm] PASSED [ 44%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[-vllm] PASSED [ 55%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[-vllm] PASSED [ 66%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[-vllm] PASSED [ 77%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[-vllm] PASSED [ 88%] llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[-vllm] PASSED [100%] =========================== 9 passed, 13 warnings in 97.18s (0:01:37) =========================== ``` ## Sources ## Before submitting - [X] Ran pre-commit to handle lint / formatting issues. - [X] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [ ] Wrote necessary unit or integration tests. --------- Co-authored-by: Sébastien Han <seb@redhat.com> Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>	2025-03-07 13:38:23 -08:00
Ashwin Bharambe	314ee09ae3	chore: move all Llama Stack types from llama-models to llama-stack (#1098 ) llama-models should have extremely minimal cruft. Its sole purpose should be didactic -- show the simplest implementation of the llama models and document the prompt formats, etc. This PR is the complement to https://github.com/meta-llama/llama-models/pull/279 ## Test Plan Ensure all `llama` CLI `model` sub-commands work: ```bash llama model list llama model download --model-id ... llama model prompt-format -m ... ``` Ran tests: ```bash cd tests/client-sdk LLAMA_STACK_CONFIG=fireworks pytest -s -v inference/ LLAMA_STACK_CONFIG=fireworks pytest -s -v vector_io/ LLAMA_STACK_CONFIG=fireworks pytest -s -v agents/ ``` Create a fresh venv `uv venv && source .venv/bin/activate` and run `llama stack build --template fireworks --image-type venv` followed by `llama stack run together --image-type venv` <-- the server runs Also checked that the OpenAPI generator can run and there is no change in the generated files as a result. ```bash cd docs/openapi_generator sh run_openapi_generator.sh ```	2025-02-14 09:10:59 -08:00
Yuan Tang	34ab7a3b6c	Fix precommit check after moving to ruff (#927 ) Lint check in main branch is failing. This fixes the lint check after we moved to ruff in https://github.com/meta-llama/llama-stack/pull/921. We need to move to a `ruff.toml` file as well as fixing and ignoring some additional checks. Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-02-02 06:46:45 -08:00
Ashwin Bharambe	2411a44833	Update more distribution docs to be simpler and partially codegen'ed	2024-11-20 22:03:44 -08:00
Ashwin Bharambe	e84d4436b5	Since we are pushing for HF repos, we should accept them in inference configs (#497 ) # What does this PR do? As the title says. ## Test Plan This needs `8752149f58` to also land. So the next package (0.0.54) will make this work properly. The test is: ```bash pytest -v -s -m "llama_3b and meta_reference" test_model_registration.py ```	2024-11-20 16:14:37 -08:00
Ashwin Bharambe	2a31163178	Auto-generate distro yamls + docs (#468 ) # What does this PR do? Automatically generates - build.yaml - run.yaml - run-with-safety.yaml - parts of markdown docs for the distributions. ## Test Plan At this point, this only updates the YAMLs and the docs. Some testing (especially with ollama and vllm) has been performed but needs to be much more tested.	2024-11-18 14:57:06 -08:00
Xi Yan	ba82021d4b	precommit	2024-11-08 17:58:58 -08:00
Ashwin Bharambe	694c142b89	Add provider deprecation support; change directory structure (#397 ) * Add provider deprecation support; change directory structure * fix a couple dangling imports * move the meta_reference safety dir also	2024-11-07 13:04:53 -08:00

Renamed from llama_stack/providers/inline/vllm/config.py (Browse further)

9 commits