# What does this PR do?
Adds a write worker queue for writes to inference store. This avoids
overwhelming request processing with slow inference writes.
## Test Plan
Benchmark:
```
cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct
```
## RPS from 21 -> 57
# What does this PR do?
This PR adds support for OpenAI Prompts API.
Note, OpenAI does not explicitly expose the Prompts API but instead
makes it available in the Responses API and in the [Prompts
Dashboard](https://platform.openai.com/docs/guides/prompting#create-a-prompt).
I have added the following APIs:
- CREATE
- GET
- LIST
- UPDATE
- Set Default Version
The Set Default Version API is made available only in the Prompts
Dashboard and configures which prompt version is returned in the GET
(the latest version is the default).
Overall, the expected functionality in Responses will look like this:
```python
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
prompt={
"id": "pmpt_68b0c29740048196bd3a6e6ac3c4d0e20ed9a13f0d15bf5e",
"version": "2",
"variables": {
"city": "San Francisco",
"age": 30,
}
}
)
```
### Resolves https://github.com/llamastack/llama-stack/issues/3276
## Test Plan
Unit tests added. Integration tests can be added after client
generation.
## Next Steps
1. Update Responses API to support Prompt API
2. I'll enhance the UI to implement the Prompt Dashboard.
3. Add cache for lower latency
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Add Kubernetes authentication provider support
- Add KubernetesAuthProvider class for token validation using Kubernetes
SelfSubjectReview API
- Add KubernetesAuthProviderConfig with configurable API server URL, TLS
settings, and claims mapping
- Implement authentication via POST requests to
/apis/authentication.k8s.io/v1/selfsubjectreviews endpoint
- Add support for parsing Kubernetes SelfSubjectReview response format
to extract user information
- Add KUBERNETES provider type to AuthProviderType enum
- Update create_auth_provider factory function to handle 'kubernetes'
provider type
- Add comprehensive unit tests for KubernetesAuthProvider functionality
- Add documentation with configuration examples and usage instructions
The provider validates tokens by sending SelfSubjectReview requests to
the Kubernetes API server and extracts user information from the
userInfo structure in the response.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
What This Verifies:
Authentication header validation
Token validation with Kubernetes SelfSubjectReview and kubernetes server
API endpoint
Error handling for invalid tokens and HTTP errors
Request payload structure and headers
```
python -m pytest tests/unit/server/test_auth.py -k "kubernetes" -v
```
Signed-off-by: Akram Ben Aissi <akram.benaissi@gmail.com>
# What does this PR do?
Improved bedrock provider config to read from environment variables like
AWS_ACCESS_KEY_ID. Updated all
fields to use default_factory with lambda patterns like the nvidia
provider does.
Now the environment variables work as documented.
Closes#3305
## Test Plan
Ran the new bedrock config tests:
```bash
python -m pytest tests/unit/providers/inference/bedrock/test_config.py
-v
Verified existing provider tests still work:
python -m pytest tests/unit/providers/test_configs.py -v
One needed to specify record-replay related environment variables for
running integration tests. We could not use defaults because integration
tests could be run against Ollama instances which could be running
different models. For example, text vs vision tests needed separate
instances of Ollama because a single instance typically cannot serve
both of these models if you assume the standard CI worker configuration
on Github. As a result, `client.list()` as returned by the Ollama client
would be different between these runs and we'd end up overwriting
responses.
This PR "solves" it by adding a small amount of complexity -- we store
model list responses specially, keyed by the hashes of the models they
return. At replay time, we merge all of them and pretend that we have
the union of all models available.
## Test Plan
Re-recorded all the tests using `scripts/integration-tests.sh
--inference-mode record`, including the vision tests.
# What does this PR do?
BFCL scoring function is not supported, removing it.
Also minor fixes as the llama stack run is broken for open-benchmark for
test plan verification
1. Correct the model paths for supported models
2. Fix another issue as there is no `provider_id` for DatasetInput but
logger assumes it exists.
```
File "/Users/swapna942/llama-stack/llama_stack/core/stack.py", line 332, in construct_stack
await register_resources(run_config, impls)
File "/Users/swapna942/llama-stack/llama_stack/core/stack.py", line 108, in register_resources
logger.debug(f"registering {rsrc.capitalize()} {obj} for provider {obj.provider_id}")
^^^^^^^^^^^^^^^
File "/Users/swapna942/llama-stack/.venv/lib/python3.13/site-packages/pydantic/main.py", line 991, in __getattr__
raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'DatasetInput' object has no attribute 'provider_id'
```
## Test Plan
```llama stack build --distro open-benchmark --image-type venv``` and run the server succeeds
Issue Link: https://github.com/llamastack/llama-stack/issues/3282
**Description:**
Adding information and guidelines on when contributors should create an
in-tree vs out-of-tree provider.
Im still learning a bit about this subject so Im very open to feedback
on this PR
Will also add this section to the API Providers section of the docs
# What does this PR do?
Finding these issues while moving to github pages.
## Test Plan
uv run --group docs sphinx-autobuild docs/source docs/build/html
--write-all
# What does this PR do?
the post training docs are missing references to the more indepth
`huggingface.md` and `torchtune.md` which explain how to actually use
the providers.
These files show up in search though.
Add references to these files into the `inline_..md` files currently
pointed to by `index.md`
Signed-off-by: Charlie Doern <cdoern@redhat.com>
The `trl` dependency brings in `accelerate` which brings in nvidia
dependencies for torch. We cannot have that in the starter distro. As
such, no CPU-only post-training for the huggingface provider.
# What does this PR do?
Add LLAMAStack + Langchain integration example notebook
## Test Plan
Ran in Jupyter notebook, works end to end.
(Used Claude mainly for documentation and coding/debugging help)
The starter distribution added post-training which added torch
dependencies which pulls in all the nvidia CUDA libraries. This made our
starter container very big. We have worked hard to keep the starter
container small so it serves its purpose as a starter. This PR tries to
get it back to its size by forking off duplicate "-gpu" providers for
post-training. These forked providers are then used for a new
`starter-gpu` distribution which can pull in all dependencies.
# What does this PR do?
Context: https://github.com/meta-llama/llama-stack/issues/2937
The API design is inspired by existing offerings, but not exactly the
same:
* `top_n` as the parameter to control number of results, instead of
`top_k`, since `n` is conventional to control number
* `truncation` bool instead of `max_token_per_doc`, since we should just
handle the truncation automatically depending on model capability,
instead of user setting the context length manually.
* `data` field in the response, to be consistent with other OpenAI APIs
(though they don't have a rerank API). Also, it is one less name to
learn in the API.
## Test Plan
Co-authored-by: Ashwin Bharambe <ashwin.bharambe@gmail.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR removes `init()` from `LlamaStackAsLibrary`
Currently client.initialize() had to be invoked by user.
To improve dev experience and to avoid runtime errors, this PR init
LlamaStackAsLibrary implicitly upon using the client.
It prevents also multiple init of the same client, while maintaining
backward ccompatibility.
This PR does the following
- Automatic Initialization: Constructor calls initialize_impl()
automatically.
- Client is fully initialized after __init__ completes.
- Prevents consecutive initialization after the client has been
successfully initialized.
- initialize() method still exists but is now a no-op.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
fixes https://github.com/meta-llama/llama-stack/issues/2946
---------
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Adds flexible CORS (Cross-Origin Resource Sharing) configuration support
to the FastAPI
server with both local development and explicit configuration modes:
- **Local development mode**: `cors: true` enables localhost-only access
with regex
pattern `https?://localhost:\d+`
- **Explicit configuration mode**: Specific origins configuration with
credential support
and validation
- Prevents insecure combinations (wildcards with credentials)
- FastAPI CORSMiddleware integration via `model_dump()`
Addresses the need for configurable CORS policies to support web
frontends and
cross-origin API access while maintaining security.
Closes#2119
## Test Plan
1. Ran Unit Tests.
2. Manual tests: FastAPI middleware integration with actual HTTP
requests
- Local development mode localhost access validation
- Explicit configuration mode origins validation
- Preflight OPTIONS request handling
Some screenshots of manual tests.
<img width="1920" height="927" alt="image"
src="https://github.com/user-attachments/assets/79322338-40c7-45c9-a9ea-e3e8d8e2f849"
/>
<img width="1911" height="1037" alt="image"
src="https://github.com/user-attachments/assets/1683524e-b0c9-48c9-a0a5-782e949cde01"
/>
cc: @leseb @rhuss @franciscojavierarceo
# What does this PR do?
Small docs change as requested in
https://github.com/llamastack/llama-stack/pull/3160#pullrequestreview-3125038932
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
Creates a structured testing documentation section with multiple detailed pages:
- Testing overview explaining the record-replay architecture
- Integration testing guide with practical usage examples
- Record-replay system technical documentation
- Guide for writing effective tests
- Troubleshooting guide for common testing issues
Hopefully this makes things a bit easier.
# What does this PR do?
Refactors the OpenAI responses implementation by extracting streaming and tool execution logic into separate modules. This improves code organization by:
1. Creating a new `StreamingResponseOrchestrator` class in `streaming.py` to handle the streaming response generation logic
2. Moving tool execution functionality to a dedicated `ToolExecutor` class in `tool_executor.py`
## Test Plan
Existing tests
# What does this PR do?
Adds content part streaming events to the OpenAI-compatible Responses API to support more granular streaming of response content. This introduces:
1. New schema types for content parts: `OpenAIResponseContentPart` with variants for text output and refusals
2. New streaming event types:
- `OpenAIResponseObjectStreamResponseContentPartAdded` for when content parts begin
- `OpenAIResponseObjectStreamResponseContentPartDone` for when content parts complete
3. Implementation in the reference provider to emit these events during streaming responses. Also emits MCP arguments just like function call ones.
## Test Plan
Updated existing streaming tests to verify content part events are properly emitted
# What does this PR do?
To be compliant with model policies for LLAMA, just return the
categories as is from provider, we will lose the OAI compat in
moderations api response.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
`SAFETY_MODEL=llama-guard3:8b LLAMA_STACK_CONFIG=starter uv run pytest
-v tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
**Description:**
The standard markdown [!NOTE] format is not supported on Sphinx
generated documentation, replacing those instances. Also updating other
Notes, Tips and Warning blocks throughout the source docs
WIP: Working to update the provider code gen
Well our Responses tests use it so we better include it in the API, no?
I discovered it because I want to make sure `llama-stack-client` can be
used always instead of `openai-python` as the client (we do want to be
_truly_ compatible.)
# What does this PR do?
the minimum python version for the project was bumped to 3.12 a couple
months ago, but there remains some artifacts in the repo suggesting we
support >=3.10
Signed-off-by: Nathan Weinberg <nweinber@redhat.com>
# What does this PR do?
- Adds documentation on how to contribute a Vector DB provider.
- Updates the testing section to be a little friendlier to navigate.
- Also added new shortcut for search so that `/` and `⌘ K` or `ctrl+K`
trigger search
<img width="1903" height="1346" alt="Screenshot 2025-08-11 at 10 10
12 AM"
src="https://github.com/user-attachments/assets/6995b3b8-a2ab-4200-be72-c5b03a784a29"
/>
<img width="1915" height="1438" alt="Screenshot 2025-08-11 at 10 10
25 AM"
src="https://github.com/user-attachments/assets/1f54d30e-5be1-4f27-b1e9-3c3537dcb8e9"
/>
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
- Add new Vertex AI remote inference provider with litellm integration
- Support for Gemini models through Google Cloud Vertex AI platform
- Uses Google Cloud Application Default Credentials (ADC) for
authentication
- Added VertexAI models: gemini-2.5-flash, gemini-2.5-pro,
gemini-2.0-flash.
- Updated provider registry to include vertexai provider
- Updated starter template to support Vertex AI configuration
- Added comprehensive documentation and sample configuration
<!-- If resolving an issue, uncomment and update the line below -->
relates to https://github.com/meta-llama/llama-stack/issues/2747
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Eran Cohen <eranco@redhat.com>
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
Update Milvus doc on using search modes.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com>
# What does this PR do?
`AgentEventLogger` only supports streaming responses, so I suggest
adding a comment near the bottom of `demo_script.py` letting the user
know this, e.g., if they change the `stream` value to `False` in the
call to `create_turn`, they need to comment out the logging lines.
See https://github.com/llamastack/llama-stack-client-python/issues/15
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
---------
Signed-off-by: Dean Wampler <dean.wampler@ibm.com>
# What does this PR do?
This PR adds Open AI Compatible moderations api. Currently only
implementing for llama guard safety provider
Image support, expand to other safety providers and Deprecation of
run_shield will be next steps.
## Test Plan
Added 2 new tests for safe/ unsafe text prompt examples for the new open
ai compatible moderations api usage
`SAFETY_MODEL=llama-guard3:8b LLAMA_STACK_CONFIG=starter uv run pytest
-v tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
(Had some issue with previous PR
https://github.com/meta-llama/llama-stack/pull/2994 while updating and
accidentally close it , reopened new one )
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Remove pure venv (without uv) references in docs
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->