llama-stack

forked from phoenix-oss/llama-stack-mirror

Author	SHA1	Message	Date
Ashwin Bharambe	a5d6ab16b2	fix: meta-reference parallel utils bug, use isinstance not equality	2025-04-24 11:27:49 -07:00
Ashwin Bharambe	f34f22f8c7	feat: add batch inference API to llama stack inference (#1945 ) # What does this PR do? This PR adds two methods to the Inference API: - `batch_completion` - `batch_chat_completion` The motivation is for evaluations targeting a local inference engine (like meta-reference or vllm) where batch APIs provide for a substantial amount of acceleration. Why did I not add this to `Api.batch_inference` though? That just resulted in a _lot_ more book-keeping given the structure of Llama Stack. Had I done that, I would have needed to create a notion of a "batch model" resource, setup routing based on that, etc. This does not sound ideal. So what's the future of the batch inference API? I am not sure. Maybe we can keep it for true _asynchronous_ execution. So you can submit requests, and it can return a Job instance, etc. ## Test Plan Run meta-reference-gpu using: ```bash export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000 export MODEL_PARALLEL_SIZE=4 export MAX_BATCH_SIZE=32 export MAX_SEQ_LEN=6144 LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu ``` Then run the batch inference test case.	2025-04-12 11:41:12 -07:00
Ashwin Bharambe	530d4bdfe1	refactor: move all llama code to models/llama out of meta reference (#1887 ) # What does this PR do? Move around bits. This makes the copies from llama-models _much_ easier to maintain and ensures we don't entangle meta-reference specific tidbits into llama-models code even by accident. Also, kills the meta-reference-quantized-gpu distro and rolls quantization deps into meta-reference-gpu. ## Test Plan ``` LLAMA_MODELS_DEBUG=1 \ with-proxy llama stack run meta-reference-gpu \ --env INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct \ --env INFERENCE_CHECKPOINT_DIR=<DIR> \ --env MODEL_PARALLEL_SIZE=4 \ --env QUANTIZATION_TYPE=fp8_mixed ``` Start a server with and without quantization. Point integration tests to it using: ``` pytest -s -v tests/integration/inference/test_text_inference.py \ --stack-config http://localhost:8321 --text-model meta-llama/Llama-4-Scout-17B-16E-Instruct ```	2025-04-07 15:03:58 -07:00
yyymeta	b79e0435de	fix: avoid tensor memory error (#1688 ) # What does this PR do? we randomly get errors like the following, it's most likely due to accessing an object that is already deallocated ``` E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] Traceback (most recent call last): E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] fn(i, args) E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 611, in _wrap E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] ret = record(fn)(args_) E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] return f(args, kwargs) E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/internal-llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 249, in worker_process_entrypoint E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] task = req_gen.send(result) E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/internal-llama-stack/llama_stack/providers/inline/inference/meta_reference/parallel_utils.py", line 156, in retrieve_requests E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] torch.distributed.broadcast_object_list( E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] return func(args, **kwargs) E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3504, in broadcast_object_list E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] object_list[i] = _tensor_to_object(obj_view, obj_size, group) E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] File "/home/yyy/.conda/envs/myenv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2961, in _tensor_to_object E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] return _unpickler(io.BytesIO(buf)).load() E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] EOFError: Ran out of input E0318 12:55:24.472000 1562188 site-packages/torch/distributed/elastic/multiprocessing/api.py:732] Process SpawnProcess-1: Traceback (most recent call last): ``` ## Test Plan start server ``` llama-stack-client eval run-benchmark mmmu_v1 --model-id meta-llama/Llama-4-17B-Omni-Instruct --output-dir /tmp/mmmu_standard --num-examples 30 ``` [//]: # (## Documentation)	2025-03-18 16:17:29 -07:00
Ashwin Bharambe	02066591b8	refactor: move generation.py to llama3	2025-03-03 13:46:50 -08:00
Sébastien Han	6fa257b475	chore(lint): update Ruff ignores for project conventions and maintainability (#1184 ) - Added new ignores from flake8-bugbear (`B007`, `B008`) - Ignored `C901` (high function complexity) for now, pending review - Maintained PyTorch conventions (`N812`, `N817`) - Allowed `E731` (lambda assignments) for flexibility - Consolidated existing ignores (`E402`, `E501`, `F405`, `C408`, `N812`) - Documented rationale for each ignored rule This keeps our linting aligned with project needs while tracking potential fixes. Signed-off-by: Sébastien Han <seb@redhat.com> Signed-off-by: Sébastien Han <seb@redhat.com>	2025-02-28 09:36:49 -08:00
Sébastien Han	c223b1862b	fix: resolve type hint issues and import dependencies (#1176 ) # What does this PR do? - Fixed type hinting and missing imports across multiple modules. - Improved compatibility by using `TYPE_CHECKING` for conditional imports. - Updated `pyproject.toml` to enforce stricter linting. Signed-off-by: Sébastien Han <seb@redhat.com> Signed-off-by: Sébastien Han <seb@redhat.com>	2025-02-25 11:06:47 -08:00
Sébastien Han	e4a1579e63	build: format codebase imports using ruff linter (#1028 ) # What does this PR do? - Configured ruff linter to automatically fix import sorting issues. - Set --exit-non-zero-on-fix to ensure non-zero exit code when fixes are applied. - Enabled the 'I' selection to focus on import-related linting rules. - Ran the linter, and formatted all codebase imports accordingly. - Removed the black dep from the "dev" group since we use ruff Signed-off-by: Sébastien Han <seb@redhat.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan [Describe the tests you ran to verify your changes with result summaries. Provide clear instructions so the plan can be easily re-executed.] [//]: # (## Documentation) [//]: # (- [ ] Added a Changelog entry if the change is significant) Signed-off-by: Sébastien Han <seb@redhat.com>	2025-02-13 10:06:21 -08:00
Yuan Tang	34ab7a3b6c	Fix precommit check after moving to ruff (#927 ) Lint check in main branch is failing. This fixes the lint check after we moved to ruff in https://github.com/meta-llama/llama-stack/pull/921. We need to move to a `ruff.toml` file as well as fixing and ignoring some additional checks. Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>	2025-02-02 06:46:45 -08:00
Ashwin Bharambe	23f1980f9c	Fix meta-reference GPU implementation for inference	2025-01-22 18:31:59 -08:00
Ashwin Bharambe	f19eb8eee3	Update types in parallel_utils for meta-refernece-gpu impl	2024-12-19 13:58:41 -08:00
Botao Chen	36b4fe02cc	[4/n][torchtune integration] support lazy load model during inference (#620 ) ## What does this PR do? In this PR, we refactor the meta reference inference logic to support - load the model during registering model instead of during spinning up server - support inference finetuned model checkpoint on top of native llama model ## Why need these changes To solve the existing pain points that - user cannot lazy load the model and hot switch the inference checkpoint after spinning up the server - this blocks us doing inference and eval on the same sever for a finetuned checkpoint after post training - user cannot do inference on a finetuned checkpoint on top of native llama models ## Expect user experience change - The inference model won't be loaded when spinning up server. Instead, it will be loaded during register model. If user add the model as models resource in run.yaml, it will be registered and loaded automatically when starting server. There is an optional flag 'skip_initialize' in model metadata to skip model loading during registration. - There is an optional flag 'llama_model' in model metadata to identify the base model of the Model class for validation and initialize model arch. model identifier no longer needs to be a native llama model - the default inference model name updates from 'meta-llama/Llama-3.2-3B-Instruct' to 'Llama3.2-3B-Instruct' - It aligns with the checkpoint folder name after running 'llama model download' - It aligns with the descriptor name defined in llama-models SKU list `bf5b0c4fe7/models/datatypes.py (L95)` ## test run python llama_stack/scripts/distro_codegen.py run unit test - torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="Llama3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_text_inference.py - torchrun $CONDA_PREFIX/bin/pytest -v -s -k "meta_reference" --inference-model="Llama3.1-8B-Instruct" ./llama_stack/providers/tests/inference/test_model_registration.py test post training experience on server side run: llama stack run llama_stack/templates/experimental-post-training/run.yaml server is spinning up without model loaded <img width="812" alt="Screenshot 2024-12-17 at 1 24 50 PM" src="https://github.com/user-attachments/assets/ce1f606b-3b6f-452f-b48e-b3761ffd90f3" /> on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 models register Llama3.2-3B-Instruct register model successfully and the model is loaded <img width="1111" alt="Screenshot 2024-12-17 at 1 26 30 PM" src="https://github.com/user-attachments/assets/56e02131-cf7d-4de5-8f63-fbdcb8c55c26" /> <img width="1541" alt="Screenshot 2024-12-17 at 1 26 09 PM" src="https://github.com/user-attachments/assets/a83255a1-20f5-40a2-af51-55641410a115" /> if add "skip_initialize" in metadata, model is registered but isn't loaded on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 inference chat-completion --message "hello, what model are you?" Inference the model succesfully <img width="1121" alt="Screenshot 2024-12-17 at 1 27 33 PM" src="https://github.com/user-attachments/assets/8e708545-3fe7-4a73-8754-1470fa5f1e75" /> test inference experience run: llama stack run llama_stack/templates/meta-reference-gpu/run.yaml model is loaded since the model is in resouce list in run.yaml <img width="1537" alt="Screenshot 2024-12-17 at 1 30 19 PM" src="https://github.com/user-attachments/assets/5c8af817-66eb-43f8-bf4c-f5e24b0a12c6" /> on client side, run: llama-stack-client --endpoint http://devgpu018.nha2.facebook.com:5000 inference chat-completion --message "hello, what model are you?" inference successfully <img width="1123" alt="Screenshot 2024-12-17 at 1 31 08 PM" src="https://github.com/user-attachments/assets/471809aa-c65e-46dc-a37e-7094fb857f97" /> ## inference on a finetuned model register a finetuned model that finetuned by post training api (torchtune) - the model is registered and loaded successfully - the model is shown up in the model list <img width="974" alt="Screenshot 2024-12-18 at 3 56 33 PM" src="https://github.com/user-attachments/assets/2994b4f5-4fa9-40c6-acc6-4b971479f3e2" /> run inference <img width="977" alt="Screenshot 2024-12-18 at 3 57 59 PM" src="https://github.com/user-attachments/assets/d117abbc-b2a0-41d8-a028-1a13128787b2" />	2024-12-18 16:30:53 -08:00
Dinesh Yeduguru	6395dadc2b	use logging instead of prints (#499 ) # What does this PR do? This PR moves all print statements to use logging. Things changed: - Had to add `await start_trace("sse_generator")` to server.py to actually get tracing working. else was not seeing any logs - If no telemetry provider is provided in the run.yaml, we will write to stdout - by default, the logs are going to be in JSON, but we expose an option to configure to output in a human readable way.	2024-11-21 11:32:53 -08:00
Xi Yan	ba82021d4b	precommit	2024-11-08 17:58:58 -08:00
Ashwin Bharambe	694c142b89	Add provider deprecation support; change directory structure (#397 ) * Add provider deprecation support; change directory structure * fix a couple dangling imports * move the meta_reference safety dir also	2024-11-07 13:04:53 -08:00

15 commits