# What does this PR do?
remove unused chat_completion implementations
vllm features ported -
- requires max_tokens be set, use config value
- set tool_choice to none if no tools provided
## Test Plan
ci
This is a sweeping change to clean up some gunk around our "Tool"
definitions.
First, we had two types `Tool` and `ToolDef`. The first of these was a
"Resource" type for the registry but we had stopped registering tools
inside the Registry long back (and only registered ToolGroups.) The
latter was for specifying tools for the Agents API. This PR removes the
former and adds an optional `toolgroup_id` field to the latter.
Secondly, as pointed out by @bbrowning in
https://github.com/llamastack/llama-stack/pull/3003#issuecomment-3245270132,
we were doing a lossy conversion from a full JSON schema from the MCP
tool specification into our ToolDefinition to send it to the model.
There is no necessity to do this -- we ourselves aren't doing any
execution at all but merely passing it to the chat completions API which
supports this. By doing this (and by doing it poorly), we encountered
limitations like not supporting array items, or not resolving $refs,
etc.
To fix this, we replaced the `parameters` field by `{ input_schema,
output_schema }` which can be full blown JSON schemas.
Finally, there were some types in our llama-related chat format
conversion which needed some cleanup. We are taking this opportunity to
clean those up.
This PR is a substantial breaking change to the API. However, given our
window for introducing breaking changes, this suits us just fine. I will
be landing a concurrent `llama-stack-client` change as well since API
shapes are changing.
# What does this PR do?
inference/rerank is the one route in the API intended to not be
deprecated. Level it as v1alpha.
Additionally, remove `experimental` and opt to instead use `v1alpha`
which itself implies an experimental state based on the original
proposal
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Mirroring the same changes that was used for inference_store:
https://github.com/llamastack/llama-stack/pull/3383
Will follow up with a shared internal API for managing these write
queues.
## Test Plan
existing tests
# What does this PR do?
APIs removed:
- POST /v1/batch-inference/completion
- POST /v1/batch-inference/chat-completion
- POST /v1/inference/batch-completion
- POST /v1/inference/batch-chat-completion
note -
- batch-completion & batch-chat-completion were only implemented for
inference=inline::meta-reference
- batch-inference were not implemented
the commit to be reverted is an public api behavior change to something
we should not support.
instead of allowing silent updates (the caller cannot see the log
messages), we should be sending an error to the caller that they must
first unregister the model before reusing the same name w/ a different
backend.
# What does this PR do?
When listing (and lazily indexing) tools, it's possible for an error to
get thrown by individual toolgroups if for example an MCP toolgroup is
unable to connect to its `mcp_endpoint`.
This logs a warning in the server when that happens, logs a full stack
trace of the error if debug logging is enabled, and just returns the
list of tools from all working toolgroups instead of throwing an error
to the client when a single toolgroup is temporarily or permanently
misbehaving.
The exception to the above is authentication errors, which we
specifically send all the way back to the client as that's how we
indicate to the client that it needs to provide authentication data for
the remote MCP servers.
Closes#2540
## Test Plan
A new unit test was added to test this exception handling, which is run
as part of our regular test suite but also manually run to specifically
verify this fix via:
```
uv run pytest -sv --asyncio-mode=auto \
tests/unit/distribution/routers/test_routing_tables.py
```
To verify the additional debug logging is printing properly:
```
LLAMA_STACK_LOGGING=core=debug \
uv run pytest -sv --asyncio-mode=auto \
tests/unit/distribution/routers/test_routing_tables.py
```
The mcp integration tests were run as below (and by CI):
```
ollama run llama3.2:3b
ENABLE_OLLAMA="ollama" \
OLLAMA_INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
LLAMA_STACK_CONFIG=starter \
uv run pytest -sv tests/integration/tool_runtime/test_mcp.py \
--text-model meta-llama/Llama-3.2-3B-Instruct
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: Sébastien Han <seb@redhat.com>
Co-authored-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
Rather than have a single `LLAMA_STACK_VERSION`, we need to have a
`_V1`, `_V1ALPHA`, and `_V1BETA` constant.
This also necessitated addition of `level` to the `WebMethod` so that
routing can be handeled properly.
For backwards compat, the `v1` routes are being kept around and marked
as `deprecated`. When used, the server will log a deprecation warning.
Deprecation log:
<img width="1224" height="134" alt="Screenshot 2025-09-25 at 2 43 36 PM"
src="https://github.com/user-attachments/assets/0cc7c245-dafc-48f0-be99-269fb9a686f9"
/>
move:
1. post_training to `v1alpha` as it is under heavy development and not
near its final state
2. eval: job scheduling is not implemented. Relies heavily on the
datasetio API which is under development missing implementations of
specific routes indicating the structure of those routes might change.
Additionally eval depends on the `inference` API which is going to be
deprecated, eval will likely need a major API surface change to conform
to using completions properly
implements leveling in #3317
note: integration tests will fail until the SDK is regenerated with
v1alpha/inference as opposed to v1/inference
## Test Plan
existing tests should pass with newly generated schema. Conformance will
also pass as these routes are not the ones we currently test for
stability
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
error:
5099094847
## Test Plan
GITHUB_ACTIONS=true BUILD_PLATFORM=linux/amd64 USE_COPY_NOT_MOUNT=true
LLAMA_STACK_DIR=. uv run --with llama-stack llama stack build --distro
starter --image-type container --image-name ehhuang/distribution-starter
succeeds
- Catch Errors from providers without API keys during model refresh
- Log as warning instead of exception to avoid a scary startup
Closes: #3492
Error message are now warnings instead of several tracebacks
```
INFO 2025-09-19 16:06:55,228 llama_stack.providers.utils.inference.inference_store:74 inference_store: Write queue disabled for SQLite to avoid
concurrency issues
WARNING 2025-09-19 16:06:59,362 llama_stack.providers.utils.inference.openai_mixin:327 providers::utils: Failed to list models for anthropic: API key
is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"anthropic_api_key": "<API_KEY>"},
or in the provider config.
WARNING 2025-09-19 16:06:59,364 llama_stack.providers.utils.inference.openai_mixin:327 providers::utils: Failed to list models for gemini: API key is
not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"gemini_api_key": "<API_KEY>"}, or in
the provider config.
WARNING 2025-09-19 16:06:59,367 llama_stack.providers.utils.inference.openai_mixin:327 providers::utils: Failed to list models for groq: API key is
not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"groq_api_key": "<API_KEY>"}, or in
the provider config.
WARNING 2025-09-19 16:06:59,372 llama_stack.providers.utils.inference.openai_mixin:327 providers::utils: Failed to list models for sambanova: API key
is not set. Please provide a valid API key in the provider data header, e.g. x-llamastack-provider-data: {"sambanova_api_key": "<API_KEY>"},
or in the provider config.
INFO 2025-09-19 16:06:59,533 llama_stack.core.utils.config_resolution:45 core: Using file path:
```
Signed-off-by: Derek Higgins <derekh@redhat.com>
# What does this PR do?
As shown in #3421, we can scale stack to handle more RPS with k8s
replicas. This PR enables multi process stack with uvicorn --workers so
that we can achieve the same scaling without being in k8s.
To achieve that we refactor main to split out the app construction
logic. This method needs to be non-async. We created a new `Stack` class
to house impls and have a `start()` method to be called in lifespan to
start background tasks instead of starting them in the old
`construct_stack`. This way we avoid having to manage an event loop
manually.
## Test Plan
CI
> uv run --with llama-stack python -m llama_stack.core.server.server
benchmarking/k8s-benchmark/stack_run_config.yaml
works.
> LLAMA_STACK_CONFIG=benchmarking/k8s-benchmark/stack_run_config.yaml uv
run uvicorn llama_stack.core.server.server:create_app --port 8321
--workers 4
works.
# What does this PR do?
currently `RemoteProviderSpec` has an `AdapterSpec` embedded in it.
Remove `AdapterSpec`, and put its leftover fields into
`RemoteProviderSpec`.
Additionally, many of the fields were duplicated between
`InlineProviderSpec` and `RemoteProviderSpec`. Move these to
`ProviderSpec` so they are shared.
Fixup the distro codegen to use `RemoteProviderSpec` directly rather
than `remote_provider_spec` which took an AdapterSpec and returned a
full provider spec
## Test Plan
existing distro tests should pass.
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
Fixes this warning in llama stack build:
```bash
WARNING 2025-09-15 15:29:02,197 llama_stack.core.distribution:149 core: Failed to import module prompts: No module named
'llama_stack.providers.registry.prompts'"
```
## Test Plan
Test added
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Modified the code in registry.py.
The key changes are:
1. Removed the `return False` statement
2. Added a warning log message that includes the object type,
identifier, and provider_id for better debugging.
3. The method now continues with the registration process instead of
early returning.
---------
Co-authored-by: Omar Abdelwahab <omara@fb.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR provides functionality for users to unregister ScoringFn and
Benchmark resources for `scoring` and `eval` APIs.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#3051
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
Updated integration and unit tests via CI workflow
# What does this PR do?
Adds a write worker queue for writes to inference store. This avoids
overwhelming request processing with slow inference writes.
## Test Plan
Benchmark:
```
cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct
```
## RPS from 21 -> 57
# What does this PR do?
- Use BackgroundLogger when logging metric events.
- Reuse event loop in BackgroundLogger
## Test Plan
```
cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct
```
### RPS from 57 -> 62
# What does this PR do?
Fix fireworks chat completion broken due to telemetry expecting
response.usage
Closes https://github.com/llamastack/llama-stack/issues/3391
## Test Plan
1. `uv run --with llama-stack llama stack build --distro starter
--image-type venv --run`
Try
```
curl -X POST http://0.0.0.0:8321/v1/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
```
{"id":"chatcmpl-ee922a08-0df0-4974-b0d3-b322113e8bc0","choices":[{"message":{"role":"assistant","content":"Hello! How can I assist you today?","name":null,"tool_calls":null},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","created":1757456375,"model":"fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct"}%
```
Without fix fails as mentioned in
https://github.com/llamastack/llama-stack/issues/3391
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
This PR adds support for OpenAI Prompts API.
Note, OpenAI does not explicitly expose the Prompts API but instead
makes it available in the Responses API and in the [Prompts
Dashboard](https://platform.openai.com/docs/guides/prompting#create-a-prompt).
I have added the following APIs:
- CREATE
- GET
- LIST
- UPDATE
- Set Default Version
The Set Default Version API is made available only in the Prompts
Dashboard and configures which prompt version is returned in the GET
(the latest version is the default).
Overall, the expected functionality in Responses will look like this:
```python
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
prompt={
"id": "pmpt_68b0c29740048196bd3a6e6ac3c4d0e20ed9a13f0d15bf5e",
"version": "2",
"variables": {
"city": "San Francisco",
"age": 30,
}
}
)
```
### Resolves https://github.com/llamastack/llama-stack/issues/3276
## Test Plan
Unit tests added. Integration tests can be added after client
generation.
## Next Steps
1. Update Responses API to support Prompt API
2. I'll enhance the UI to implement the Prompt Dashboard.
3. Add cache for lower latency
---------
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
# What does this PR do?
Add Kubernetes authentication provider support
- Add KubernetesAuthProvider class for token validation using Kubernetes
SelfSubjectReview API
- Add KubernetesAuthProviderConfig with configurable API server URL, TLS
settings, and claims mapping
- Implement authentication via POST requests to
/apis/authentication.k8s.io/v1/selfsubjectreviews endpoint
- Add support for parsing Kubernetes SelfSubjectReview response format
to extract user information
- Add KUBERNETES provider type to AuthProviderType enum
- Update create_auth_provider factory function to handle 'kubernetes'
provider type
- Add comprehensive unit tests for KubernetesAuthProvider functionality
- Add documentation with configuration examples and usage instructions
The provider validates tokens by sending SelfSubjectReview requests to
the Kubernetes API server and extracts user information from the
userInfo structure in the response.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
What This Verifies:
Authentication header validation
Token validation with Kubernetes SelfSubjectReview and kubernetes server
API endpoint
Error handling for invalid tokens and HTTP errors
Request payload structure and headers
```
python -m pytest tests/unit/server/test_auth.py -k "kubernetes" -v
```
Signed-off-by: Akram Ben Aissi <akram.benaissi@gmail.com>
# What does this PR do?
This change migrates the VectorDB id generation to Vector Stores.
This is a breaking change for **_some users_** that may have application
code using the `vector_db_id` parameter in the request of the VectorDB
protocol instead of the `VectorDB.identifier` in the response.
By default we will now create a Vector Store every time we register a
VectorDB. The caveat with this approach is that this maps the
`vector_db_id` → `vector_store.name`. This is a reasonable tradeoff to
transition users towards OpenAI Vector Stores.
As an added benefit, registering VectorDBs will result in them appearing
in the VectorStores admin UI.
### Why?
This PR makes the `POST` API call to `/v1/vector-dbs` swap the
`vector_db_id` parameter in the **request body** into the VectorStore's
name field and sets the `vector_db_id` to the generated vector store id
(e.g., `vs_038247dd-4bbb-4dbb-a6be-d5ecfd46cfdb`).
That means that users would have to do something like follows in their
application code:
```python
res = client.vector_dbs.register(
vector_db_id='my-vector-db-id',
embedding_model='ollama/all-minilm:l6-v2',
embedding_dimension=384,
)
vector_db_id = res.identifier
```
And then the rest of their code would behave, including `VectorIO`'s
insert protocol using `vector_db_id` in the request.
An alternative implementation would be to just delete the `vector_db_id`
parameter in `VectorDB` but the end result would still require users
having to write `vector_db_id = res.identifier` since
`VectorStores.create()` generates the ID for you.
So this approach felt the easiest way to migrate users towards
VectorStores (subsequent PRs will be added to trigger `files.create()`
and `vector_stores.files.create()`).
## Test Plan
Unit tests and integration tests have been added.
Signed-off-by: Francisco Javier Arceo <farceo@redhat.com>
What does this PR do?
Fixes error handling when MCP server connections fail. Instead of
returning generic 500 errors, now provides
descriptive error messages with proper HTTP status codes.
Closes#3107
Test Plan
Before fix:
curl -X GET
"http://localhost:8321/v1/tool-runtime/list-tools?tool_group_id=bad-mcp-server"
Returns: {"detail": "Internal server error: An unexpected error
occurred."} (500)
After fix:
curl -X GET
"http://localhost:8321/v1/tool-runtime/list-tools?tool_group_id=bad-mcp-server"
Returns: {"error": {"detail": "Failed to connect to MCP server at
http://localhost:9999/sse: Connection
refused"}} (502)
Tests:
- Added unit test for ConnectionError → 502 translation
- Manually tested with unreachable MCP servers (connection refused)
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR is eliminating hardcoded status codes: `409` CONFLICT and `404`
NOT_FOUND in `server.py` using `httpx` built-in constants. This
implementation will follow the existing structure to improve
readability, extensibility and developer experience. This is already was
implemented in #3131
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
`./scripts/unit-tests.sh`
# What does this PR do?
Sometimes the stream don't have chunks with finish_reason, e.g. canceled
stream, which throws a pydantic error as OpenAIChoice.finish_reason: str
## Test Plan
observe no more such error when benchmarking
Generated with CC:
Replace cryptic KeyError with clear, actionable error message that
shows:
- Which API the failing provider belongs to
- The provider ID and type that's failing
- Which dependency is missing
- Clear instructions on how to fix the issue
## Test plan
Use a run config with Agents API and no safety provider
Before: KeyError: <Api.safety: 'safety'>
After: Failed to resolve 'agents' provider 'meta-reference' of type
'inline::meta-reference': required dependency 'safety' is not available.
Please add a 'safety' provider to your configuration or check if the
provider is properly configured.
# What does this PR do?
BFCL scoring function is not supported, removing it.
Also minor fixes as the llama stack run is broken for open-benchmark for
test plan verification
1. Correct the model paths for supported models
2. Fix another issue as there is no `provider_id` for DatasetInput but
logger assumes it exists.
```
File "/Users/swapna942/llama-stack/llama_stack/core/stack.py", line 332, in construct_stack
await register_resources(run_config, impls)
File "/Users/swapna942/llama-stack/llama_stack/core/stack.py", line 108, in register_resources
logger.debug(f"registering {rsrc.capitalize()} {obj} for provider {obj.provider_id}")
^^^^^^^^^^^^^^^
File "/Users/swapna942/llama-stack/.venv/lib/python3.13/site-packages/pydantic/main.py", line 991, in __getattr__
raise AttributeError(f'{type(self).__name__!r} object has no attribute {item!r}')
AttributeError: 'DatasetInput' object has no attribute 'provider_id'
```
## Test Plan
```llama stack build --distro open-benchmark --image-type venv``` and run the server succeeds
Issue Link: https://github.com/llamastack/llama-stack/issues/3282
# What does this PR do?
During env var replacement, we're implicitly converting all config types
to their apparent types (e.g., "true" to True, "123" to 123). This may
be arguably useful for when doing an env var substitution, as those are
always strings, but we should definitely avoid touching config values
that have explicit types and are uninvolved in env var substitution.
## Test Plan
Unit
The `trl` dependency brings in `accelerate` which brings in nvidia
dependencies for torch. We cannot have that in the starter distro. As
such, no CPU-only post-training for the huggingface provider.
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR renames categories of llama_stack loggers.
This PR aligns logging categories as per the package name, as well as
reviews from initial
https://github.com/meta-llama/llama-stack/pull/2868. This is a follow up
to #3061.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Replaces https://github.com/meta-llama/llama-stack/pull/2868
Part of https://github.com/meta-llama/llama-stack/issues/2865
cc @leseb @rhuss
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR removes `init()` from `LlamaStackAsLibrary`
Currently client.initialize() had to be invoked by user.
To improve dev experience and to avoid runtime errors, this PR init
LlamaStackAsLibrary implicitly upon using the client.
It prevents also multiple init of the same client, while maintaining
backward ccompatibility.
This PR does the following
- Automatic Initialization: Constructor calls initialize_impl()
automatically.
- Client is fully initialized after __init__ completes.
- Prevents consecutive initialization after the client has been
successfully initialized.
- initialize() method still exists but is now a no-op.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
fixes https://github.com/meta-llama/llama-stack/issues/2946
---------
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Adds flexible CORS (Cross-Origin Resource Sharing) configuration support
to the FastAPI
server with both local development and explicit configuration modes:
- **Local development mode**: `cors: true` enables localhost-only access
with regex
pattern `https?://localhost:\d+`
- **Explicit configuration mode**: Specific origins configuration with
credential support
and validation
- Prevents insecure combinations (wildcards with credentials)
- FastAPI CORSMiddleware integration via `model_dump()`
Addresses the need for configurable CORS policies to support web
frontends and
cross-origin API access while maintaining security.
Closes#2119
## Test Plan
1. Ran Unit Tests.
2. Manual tests: FastAPI middleware integration with actual HTTP
requests
- Local development mode localhost access validation
- Explicit configuration mode origins validation
- Preflight OPTIONS request handling
Some screenshots of manual tests.
<img width="1920" height="927" alt="image"
src="https://github.com/user-attachments/assets/79322338-40c7-45c9-a9ea-e3e8d8e2f849"
/>
<img width="1911" height="1037" alt="image"
src="https://github.com/user-attachments/assets/1683524e-b0c9-48c9-a0a5-782e949cde01"
/>
cc: @leseb @rhuss @franciscojavierarceo
# What does this PR do?
This PR adds a step in pre-commit to enforce using `llama_stack` logger.
Currently, various parts of the code base uses different loggers. As a
custom `llama_stack` logger exist and used in the codebase, it is better
to standardize its utilization.
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
I started this PR trying to unbreak a newly broken test
`test_agent_name`. This test was broken all along but did not show up
because during testing we were pulling the "non-updated" llama stack
client. See this comment:
https://github.com/llamastack/llama-stack/pull/3119#discussion_r2270988205
While fixing this, I encountered a large amount of badness in our CI
workflow definitions.
- We weren't passing `LLAMA_STACK_DIR` or `LLAMA_STACK_CLIENT_DIR`
overrides to `llama stack build` at all in some cases.
- Even when we did, we used `uv run` liberally. The first thing `uv run`
does is "syncs" the project environment. This means, it is going to undo
any mutations we might have done ourselves. But we make many mutations
in our CI runners to these environments. The most important of which is
why `llama stack build` where we install distro dependencies. As a
result, when you tried to run the integration tests, you would see old,
strange versions.
## Test Plan
Re-record using:
```
sh scripts/integration-tests.sh --stack-config ci-tests \
--provider ollama --test-pattern test_agent_name --inference-mode record
```
Then re-run with `--inference-mode replay`. But:
Eventually, this test turned out to be quite flaky for telemetry
reasons. I haven't investigated it for now and just disabled it sadly
since we have a release to push out.
# What does this PR do?
Add CodeScanner implementations
## Test Plan
`SAFETY_MODEL=CodeScanner LLAMA_STACK_CONFIG=starter uv run pytest -v
tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
This PR need to land after this
https://github.com/meta-llama/llama-stack/pull/3098
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
I noticed somehow
[build_conda_env.sh](https://github.com/llamastack/llama-stack/blob/main/llama_stack/core/build_conda_env.sh)
exists in main branch. We need to kill it to be consistent with
[#2969](https://github.com/llamastack/llama-stack/pull/2969)
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
# What does this PR do?
To be compliant with model policies for LLAMA, just return the
categories as is from provider, we will lose the OAI compat in
moderations api response.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
`SAFETY_MODEL=llama-guard3:8b LLAMA_STACK_CONFIG=starter uv run pytest
-v tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`