llama-stack/llama_stack/distribution
Ashwin Bharambe f34f22f8c7
feat: add batch inference API to llama stack inference (#1945)
# What does this PR do?

This PR adds two methods to the Inference API:
- `batch_completion`
- `batch_chat_completion`

The motivation is for evaluations targeting a local inference engine
(like meta-reference or vllm) where batch APIs provide for a substantial
amount of acceleration.

Why did I not add this to `Api.batch_inference` though? That just
resulted in a _lot_ more book-keeping given the structure of Llama
Stack. Had I done that, I would have needed to create a notion of a
"batch model" resource, setup routing based on that, etc. This does not
sound ideal.

So what's the future of the batch inference API? I am not sure. Maybe we
can keep it for true _asynchronous_ execution. So you can submit
requests, and it can return a Job instance, etc.

## Test Plan

Run meta-reference-gpu using:
```bash
export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144

LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu
```

Then run the batch inference test case.
2025-04-12 11:41:12 -07:00
..
routers feat: add batch inference API to llama stack inference (#1945) 2025-04-12 11:41:12 -07:00
server fix: Use CONDA_DEFAULT_ENV presence as a flag to use conda mode (#1555) 2025-03-27 17:13:22 -04:00
store fix: handle registry errors gracefully (#1732) 2025-03-20 15:24:07 -07:00
ui feat: Add a direct (non-agentic) RAG option to the Playground RAG page (#1940) 2025-04-11 10:16:10 -07:00
utils refactor: move missing tests to test directory (#1892) 2025-04-08 18:54:00 -07:00
__init__.py API Updates (#73) 2024-09-17 19:51:35 -07:00
access_control.py feat: make sure agent sessions are under access control (#1737) 2025-03-21 07:31:16 -07:00
build.py refactor: simplify command execution and remove PTY handling (#1641) 2025-03-17 15:03:14 -07:00
build_conda_env.sh chore: remove straggler references to llama-models (#1345) 2025-03-01 14:26:03 -08:00
build_container.sh fix: Add missing gcc in container build. Fixes #1716 (#1727) 2025-03-20 15:50:56 -04:00
build_venv.sh chore: remove straggler references to llama-models (#1345) 2025-03-01 14:26:03 -08:00
client.py chore: move all Llama Stack types from llama-models to llama-stack (#1098) 2025-02-14 09:10:59 -08:00
common.sh fix: Fixing some small issues with the build scripts (#1132) 2025-02-19 22:20:49 -08:00
configure.py feat: add provider API for listing and inspecting provider info (#1429) 2025-03-13 15:07:21 -07:00
datatypes.py feat: ability to execute external providers (#1672) 2025-04-09 10:30:41 +02:00
distribution.py feat: ability to execute external providers (#1672) 2025-04-09 10:30:41 +02:00
inspect.py chore: deprecate /v1/inspect/providers (#1678) 2025-03-19 20:27:06 -07:00
library_client.py fix(telemetry): library client does not log span (#1833) 2025-03-29 14:55:31 -07:00
providers.py fix: add shutdown method for ProviderImpl (#1670) 2025-03-17 14:55:40 -07:00
request_headers.py feat(server): add attribute based access control for resources (#1703) 2025-03-19 21:28:52 -07:00
resolver.py feat: ability to execute external providers (#1672) 2025-04-09 10:30:41 +02:00
stack.py fix: ensure resource registration arguments are typed (#1941) 2025-04-11 09:25:57 -07:00
start_stack.sh docs: Update docs and fix warning in start-stack.sh (#1937) 2025-04-11 16:26:17 -07:00