# What does this PR do?
remove unused chat_completion implementations
vllm features ported -
- requires max_tokens be set, use config value
- set tool_choice to none if no tools provided
## Test Plan
ci
This is a sweeping change to clean up some gunk around our "Tool"
definitions.
First, we had two types `Tool` and `ToolDef`. The first of these was a
"Resource" type for the registry but we had stopped registering tools
inside the Registry long back (and only registered ToolGroups.) The
latter was for specifying tools for the Agents API. This PR removes the
former and adds an optional `toolgroup_id` field to the latter.
Secondly, as pointed out by @bbrowning in
https://github.com/llamastack/llama-stack/pull/3003#issuecomment-3245270132,
we were doing a lossy conversion from a full JSON schema from the MCP
tool specification into our ToolDefinition to send it to the model.
There is no necessity to do this -- we ourselves aren't doing any
execution at all but merely passing it to the chat completions API which
supports this. By doing this (and by doing it poorly), we encountered
limitations like not supporting array items, or not resolving $refs,
etc.
To fix this, we replaced the `parameters` field by `{ input_schema,
output_schema }` which can be full blown JSON schemas.
Finally, there were some types in our llama-related chat format
conversion which needed some cleanup. We are taking this opportunity to
clean those up.
This PR is a substantial breaking change to the API. However, given our
window for introducing breaking changes, this suits us just fine. I will
be landing a concurrent `llama-stack-client` change as well since API
shapes are changing.
# What does this PR do?
APIs removed:
- POST /v1/batch-inference/completion
- POST /v1/batch-inference/chat-completion
- POST /v1/inference/batch-completion
- POST /v1/inference/batch-chat-completion
note -
- batch-completion & batch-chat-completion were only implemented for
inference=inline::meta-reference
- batch-inference were not implemented
# What does this PR do?
Adds a write worker queue for writes to inference store. This avoids
overwhelming request processing with slow inference writes.
## Test Plan
Benchmark:
```
cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct
```
## RPS from 21 -> 57
# What does this PR do?
- Use BackgroundLogger when logging metric events.
- Reuse event loop in BackgroundLogger
## Test Plan
```
cd /docs/source/distributions/k8s-benchmark
# start mock server
python openai-mock-server.py --port 8000
# start stack server
LLAMA_STACK_LOGGING="all=WARNING" uv run --with llama-stack python -m llama_stack.core.server.server docs/source/distributions/k8s-benchmark/stack_run_config.yaml
# run benchmark script
uv run python3 benchmark.py --duration 120 --concurrent 50 --base-url=http://localhost:8321/v1/openai/v1 --model=vllm-inference/meta-llama/Llama-3.2-3B-Instruct
```
### RPS from 57 -> 62
# What does this PR do?
Fix fireworks chat completion broken due to telemetry expecting
response.usage
Closes https://github.com/llamastack/llama-stack/issues/3391
## Test Plan
1. `uv run --with llama-stack llama stack build --distro starter
--image-type venv --run`
Try
```
curl -X POST http://0.0.0.0:8321/v1/openai/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
```
{"id":"chatcmpl-ee922a08-0df0-4974-b0d3-b322113e8bc0","choices":[{"message":{"role":"assistant","content":"Hello! How can I assist you today?","name":null,"tool_calls":null},"finish_reason":"stop","index":0,"logprobs":null}],"object":"chat.completion","created":1757456375,"model":"fireworks/accounts/fireworks/models/llama-v3p1-8b-instruct"}%
```
Without fix fails as mentioned in
https://github.com/llamastack/llama-stack/issues/3391
Co-authored-by: Francisco Arceo <arceofrancisco@gmail.com>
# What does this PR do?
Sometimes the stream don't have chunks with finish_reason, e.g. canceled
stream, which throws a pydantic error as OpenAIChoice.finish_reason: str
## Test Plan
observe no more such error when benchmarking
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR renames categories of llama_stack loggers.
This PR aligns logging categories as per the package name, as well as
reviews from initial
https://github.com/meta-llama/llama-stack/pull/2868. This is a follow up
to #3061.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Replaces https://github.com/meta-llama/llama-stack/pull/2868
Part of https://github.com/meta-llama/llama-stack/issues/2865
cc @leseb @rhuss
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
Add CodeScanner implementations
## Test Plan
`SAFETY_MODEL=CodeScanner LLAMA_STACK_CONFIG=starter uv run pytest -v
tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
This PR need to land after this
https://github.com/meta-llama/llama-stack/pull/3098
# What does this PR do?
To be compliant with model policies for LLAMA, just return the
categories as is from provider, we will lose the OAI compat in
moderations api response.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
## Test Plan
`SAFETY_MODEL=llama-guard3:8b LLAMA_STACK_CONFIG=starter uv run pytest
-v tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
Some fixes to MCP tests. And a bunch of fixes for Vector providers.
I also enabled a bunch of Vector IO tests to be used with
`LlamaStackLibraryClient`
## Test Plan
Run Responses tests with llama stack library client:
```
pytest -s -v tests/integration/non_ci/responses/ --stack-config=server:starter \
--text-model openai/gpt-4o \
--embedding-model=sentence-transformers/all-MiniLM-L6-v2 \
-k "client_with_models"
```
Do the same with `-k openai_client`
The rest should be taken care of by CI.
# What does this PR do?
This PR adds Open AI Compatible moderations api. Currently only
implementing for llama guard safety provider
Image support, expand to other safety providers and Deprecation of
run_shield will be next steps.
## Test Plan
Added 2 new tests for safe/ unsafe text prompt examples for the new open
ai compatible moderations api usage
`SAFETY_MODEL=llama-guard3:8b LLAMA_STACK_CONFIG=starter uv run pytest
-v tests/integration/safety/test_safety.py
--text-model=llama3.2:3b-instruct-fp16
--embedding-model=all-MiniLM-L6-v2 --safety-shield=ollama`
(Had some issue with previous PR
https://github.com/meta-llama/llama-stack/pull/2994 while updating and
accidentally close it , reopened new one )
# What does this PR do?
I found a few issues while adding new metrics for various APIs:
currently metrics are only propagated in `chat_completion` and
`completion`
since most providers use the `openai_..` routes as the default in
`llama-stack-client inference chat-completion`, metrics are currently
not working as expected.
in order to get them working the following had to be done:
1. get the completion as usual
2. use new `openai_` versions of the metric gathering functions which
use `.usage` from the `OpenAI..` response types to gather the metrics
which are already populated.
3. define a `stream_generator` which counts the tokens and computes the
metrics (only for stream=True)
5. add metrics to response
NOTE: I could not add metrics to `openai_completion` where stream=True
because that ONLY returns an `OpenAICompletion` not an AsyncGenerator
that we can manipulate.
acquire the lock, and add event to the span as the other `_log_...`
methods do
some new output:
`llama-stack-client inference chat-completion --message hi`
<img width="2416" height="425" alt="Screenshot 2025-07-16 at 8 28 20 AM"
src="https://github.com/user-attachments/assets/ccdf1643-a184-4ddd-9641-d426c4d51326"
/>
and in the client:
<img width="763" height="319" alt="Screenshot 2025-07-16 at 8 28 32 AM"
src="https://github.com/user-attachments/assets/6bceb811-5201-47e9-9e16-8130f0d60007"
/>
these were not previously being recorded nor were they being printed to
the server due to the improper console sink handling
---------
Signed-off-by: Charlie Doern <cdoern@redhat.com>
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
Extend the Shields Protocol and implement the capability to unregister
previously registered shields and CLI for shields management.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Closes#2581
## Test Plan
<!-- Describe the tests you ran to verify your changes with result
summaries. *Provide clear instructions so the plan can be easily
re-executed.* -->
First of, test API for shields
1. Install and start Ollama:
`ollama serve`
2. Pull Llama Guard Model in Ollama:
`ollama pull llama-guard3:8b`
3. Configure env variables:
```
export ENABLE_OLLAMA=ollama
export OLLAMA_URL=http://localhost:11434
```
4. Build Llama Stack distro:
`llama stack build --template starter --image-type venv `
5. Start Llama Stack server:
`llama stack run starter --port 8321`
6. Check if Ollama model is available:
`curl -X GET http://localhost:8321/v1/models | jq '.data[] |
select(.provider_id=="ollama")'`
7. Register a new Shield using Ollama provider:
```
curl -X POST http://localhost:8321/v1/shields \
-H "Content-Type: application/json" \
-d '{
"shield_id": "test-shield",
"provider_id": "llama-guard",
"provider_shield_id": "ollama/llama-guard3:8b",
"params": {}
}'
```
`{"identifier":"test-shield","provider_resource_id":"ollama/llama-guard3:8b","provider_id":"llama-guard","type":"shield","owner":{"principal":"","attributes":{}},"params":{}}%
`
8. Check if shield was registered:
`curl -X GET http://localhost:8321/v1/shields/test-shield`
`{"identifier":"test-shield","provider_resource_id":"ollama/llama-guard3:8b","provider_id":"llama-guard","type":"shield","owner":{"principal":"","attributes":{}},"params":{}}%
`
9. Run shield:
```
curl -X POST http://localhost:8321/v1/safety/run-shield \
-H "Content-Type: application/json" \
-d '{
"shield_id": "test-shield",
"messages": [
{
"role": "user",
"content": "How can I hack into someone computer?"
}
],
"params": {}
}'
```
`{"violation":{"violation_level":"error","user_message":"I can't answer
that. Can I help with something
else?","metadata":{"violation_type":"S2"}}}% `
10. Unregister shield:
`curl -X DELETE http://localhost:8321/v1/shields/test-shield`
`null% `
11. Verify shield was deleted:
`curl -X GET http://localhost:8321/v1/shields/test-shield`
`{"detail":"Invalid value: Shield 'test-shield' not found"}%`
All tests passed ✅
```
========================================================================== 430 passed, 194 warnings in 19.54s ==========================================================================
/Users/iamiller/GitHub/llama-stack/.venv/lib/python3.12/site-packages/litellm/llms/custom_httpx/async_client_cleanup.py:78: RuntimeWarning: coroutine 'close_litellm_async_clients' was never awaited
loop.close()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
Wrote HTML report to htmlcov-3.12/index.html
```