llama-stack

forked from phoenix-oss/llama-stack-mirror

History

Ashwin Bharambe f34f22f8c7 feat: add batch inference API to llama stack inference (#1945 ) # What does this PR do? This PR adds two methods to the Inference API: - `batch_completion` - `batch_chat_completion` The motivation is for evaluations targeting a local inference engine (like meta-reference or vllm) where batch APIs provide for a substantial amount of acceleration. Why did I not add this to `Api.batch_inference` though? That just resulted in a _lot_ more book-keeping given the structure of Llama Stack. Had I done that, I would have needed to create a notion of a "batch model" resource, setup routing based on that, etc. This does not sound ideal. So what's the future of the batch inference API? I am not sure. Maybe we can keep it for true _asynchronous_ execution. So you can submit requests, and it can return a Job instance, etc. ## Test Plan Run meta-reference-gpu using: ```bash export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000 export MODEL_PARALLEL_SIZE=4 export MAX_BATCH_SIZE=32 export MAX_SEQ_LEN=6144 LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu ``` Then run the batch inference test case.		2025-04-12 11:41:12 -07:00
..
routers	feat: add batch inference API to llama stack inference (#1945 )	2025-04-12 11:41:12 -07:00
server	fix: Use CONDA_DEFAULT_ENV presence as a flag to use conda mode (#1555 )	2025-03-27 17:13:22 -04:00
store	fix: handle registry errors gracefully (#1732 )	2025-03-20 15:24:07 -07:00
ui	feat: Add a direct (non-agentic) RAG option to the Playground RAG page (#1940 )	2025-04-11 10:16:10 -07:00
utils	refactor: move missing tests to test directory (#1892 )	2025-04-08 18:54:00 -07:00
__init__.py	API Updates (#73 )	2024-09-17 19:51:35 -07:00
access_control.py	feat: make sure agent sessions are under access control (#1737 )	2025-03-21 07:31:16 -07:00
build.py	refactor: simplify command execution and remove PTY handling (#1641 )	2025-03-17 15:03:14 -07:00
build_conda_env.sh	chore: remove straggler references to llama-models (#1345 )	2025-03-01 14:26:03 -08:00
build_container.sh	fix: Add missing gcc in container build. Fixes #1716 (#1727 )	2025-03-20 15:50:56 -04:00
build_venv.sh	chore: remove straggler references to llama-models (#1345 )	2025-03-01 14:26:03 -08:00
client.py	chore: move all Llama Stack types from llama-models to llama-stack (#1098 )	2025-02-14 09:10:59 -08:00
common.sh	fix: Fixing some small issues with the build scripts (#1132 )	2025-02-19 22:20:49 -08:00
configure.py	feat: add provider API for listing and inspecting provider info (#1429 )	2025-03-13 15:07:21 -07:00
datatypes.py	feat: ability to execute external providers (#1672 )	2025-04-09 10:30:41 +02:00
distribution.py	feat: ability to execute external providers (#1672 )	2025-04-09 10:30:41 +02:00
inspect.py	chore: deprecate /v1/inspect/providers (#1678 )	2025-03-19 20:27:06 -07:00
library_client.py	fix(telemetry): library client does not log span (#1833 )	2025-03-29 14:55:31 -07:00
providers.py	fix: add shutdown method for ProviderImpl (#1670 )	2025-03-17 14:55:40 -07:00
request_headers.py	feat(server): add attribute based access control for resources (#1703 )	2025-03-19 21:28:52 -07:00
resolver.py	feat: ability to execute external providers (#1672 )	2025-04-09 10:30:41 +02:00
stack.py	fix: ensure resource registration arguments are typed (#1941 )	2025-04-11 09:25:57 -07:00
start_stack.sh	docs: Update docs and fix warning in start-stack.sh (#1937 )	2025-04-11 16:26:17 -07:00