Prevent sensitive information from being logged in telemetry output by
assigning SecretStr type to sensitive fields. API keys, password from
KV store are now covered. All providers have been converted.
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
APIs removed:
- POST /v1/batch-inference/completion
- POST /v1/batch-inference/chat-completion
- POST /v1/inference/batch-completion
- POST /v1/inference/batch-chat-completion
note -
- batch-completion & batch-chat-completion were only implemented for
inference=inline::meta-reference
- batch-inference were not implemented
# What does this PR do?
- remove auto-download of ollama embedding models
- add embedding model metadata to dynamic listing w/ unit test
- add support and tests for allowed_models
- removed inference provider models.py files where dynamic listing is
enabled
- store embedding metadata in embedding_model_metadata field on
inference providers
- make model_entries optional on ModelRegistryHelper and
LiteLLMOpenAIMixin
- make OpenAIMixin a ModelRegistryHelper
- skip base64 embedding test for remote::ollama, always returns floats
- only use OpenAI client for ollama model listing
- remove unused build_model_entry function
- remove unused get_huggingface_repo function
## Test Plan
ci w/ new tests
# What does this PR do?
<!-- Provide a short summary of what this PR does and why. Link to
relevant issues if applicable. -->
This PR renames categories of llama_stack loggers.
This PR aligns logging categories as per the package name, as well as
reviews from initial
https://github.com/meta-llama/llama-stack/pull/2868. This is a follow up
to #3061.
<!-- If resolving an issue, uncomment and update the line below -->
<!-- Closes #[issue-number] -->
Replaces https://github.com/meta-llama/llama-stack/pull/2868
Part of https://github.com/meta-llama/llama-stack/issues/2865
cc @leseb @rhuss
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
# What does this PR do?
This PR adds a step in pre-commit to enforce using `llama_stack` logger.
Currently, various parts of the code base uses different loggers. As a
custom `llama_stack` logger exist and used in the codebase, it is better
to standardize its utilization.
Signed-off-by: Mustafa Elbehery <melbeher@redhat.com>
Co-authored-by: Matthew Farrellee <matt@cs.wisc.edu>
# What does this PR do?
- Add base_url field to OpenAIConfig with default
"https://api.openai.com/v1"
- Update sample_run_config to support OPENAI_BASE_URL environment
variable
- Modify get_base_url() to return configured base_url instead of
hardcoded value
- Add comprehensive test suite covering:
- Default base URL behavior
- Custom base URL from config
- Environment variable override
- Config precedence over environment variables
- Client initialization with configured URL
- Model availability checks using configured URL
This enables users to configure custom OpenAI-compatible API endpoints
via environment variables or configuration files.
Closes#2910
## Test Plan
run unit tests
# What does this PR do?
openai/models.py has backward compat entries for litellm model names.
the starter template includes these in the list of registered models.
the inclusion results in duplicate model registrations.
the backward compat is no longer necessary.
## Test Plan
ci
# What does this PR do?
add an `OpenAIMixin` for use by inference providers who remote endpoints
support an OpenAI compatible API.
use is demonstrated by refactoring
- OpenAIInferenceAdapter
- NVIDIAInferenceAdapter (adds embedding support)
- LlamaCompatInferenceAdapter
## Test Plan
existing unit and integration tests
# What does this PR do?
Some of our inference providers support passthrough authentication via
`x-llamastack-provider-data` header values. This fixes the providers
that support passthrough auth to not cache their clients to the backend
providers (mostly OpenAI client instances) so that the client connecting
to Llama Stack has to provide those auth values on each and every
request.
## Test Plan
I added some unit tests to ensure we're not caching clients across
requests for all the fixed providers in this PR.
```
uv run pytest -sv tests/unit/providers/inference/test_inference_client_caching.py
```
I also ran some of our OpenAI compatible API integration tests for each
of the changed providers, just to ensure they still work. Note that
these providers don't actually pass all these tests (for unrelated
reasons due to quirks of the Groq and Together SaaS services), but
enough of the tests passed to confirm the clients are still working as
intended.
### Together
```
ENABLE_TOGETHER="together" \
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv \
tests/integration/inference/test_openai_completion.py \
--text-model "together/meta-llama/Llama-3.1-8B-Instruct"
```
### OpenAI
```
ENABLE_OPENAI="openai" \
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv \
tests/integration/inference/test_openai_completion.py \
--text-model "openai/gpt-4o-mini"
```
### Groq
```
ENABLE_GROQ="groq" \
uv run llama stack run llama_stack/templates/starter/run.yaml
LLAMA_STACK_CONFIG=http://localhost:8321 \
uv run pytest -sv \
tests/integration/inference/test_openai_completion.py \
--text-model "groq/meta-llama/Llama-3.1-8B-Instruct"
```
---------
Signed-off-by: Ben Browning <bbrownin@redhat.com>
# What does this PR do?
We are now testing the safety capability with the starter image. This
includes a few changes:
* Enable the safety integration test
* Relax the shield model requirements from llama-guard to make it work
with llama-guard3:8b coming from Ollama
* Expose a shield for each inference provider in the starter distro. The
shield will only be registered if the provider is enabled.
Closes: https://github.com/meta-llama/llama-stack/issues/2528
Signed-off-by: Sébastien Han <seb@redhat.com>
# What does this PR do?
* Given that our API packages use "import *" in `__init.py__` we don't
need to do `from llama_stack.apis.models.models` but simply from
llama_stack.apis.models. The decision to use `import *` is debatable and
should probably be revisited at one point.
* Remove unneeded Ruff F401 rule
* Consolidate Ruff F403 rule in the pyprojectfrom
llama_stack.apis.models.models
Signed-off-by: Sébastien Han <seb@redhat.com>
For code completion apps need "fill in the middle" capabilities.
Added option of `suffix` to `openai_completion` to enable this.
Updated ollama provider to showcase the same.
### Test Plan
```
pytest -sv --stack-config="inference=ollama" tests/integration/inference/test_openai_completion.py --text-model qwen2.5-coder:1.5b -k test_openai_completion_non_streaming_suffix
```
### OpenAI Sample script
```
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8321/v1/openai/v1")
response = client.completions.create(
model="qwen2.5-coder:1.5b",
prompt="The capital of ",
suffix="is Paris.",
max_tokens=10,
)
print(response.choices[0].text)
```
### Output
```
France is ____.
To answer this question, we
```
# What does this PR do?
Adds a new endpoint that is compatible with OpenAI for embeddings api.
`/openai/v1/embeddings`
Added providers for OpenAI, LiteLLM and SentenceTransformer.
## Test Plan
```
LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/inference/test_openai_embeddings.py --embedding-model all-MiniLM-L6-v2,text-embedding-3-small,gemini/text-embedding-004
```
# What does this PR do?
Since https://github.com/meta-llama/llama-stack/pull/2193 switched to
openai sdk, we need to strip 'openai/' from the model_id
## Test Plan
start server with openai provider and send a chat completion call
# What does this PR do?
fixes#2121
this implementation splits reponsibility between litellm and openai
libraries -
| Inference Method | Implementation Source |
|----------------------------|--------------------------|
| completion | LiteLLMOpenAIMixin |
| chat_completion | LiteLLMOpenAIMixin |
| embedding | LiteLLMOpenAIMixin |
| batch_completion | LiteLLMOpenAIMixin |
| batch_chat_completion | LiteLLMOpenAIMixin |
| openai_completion | AsyncOpenAI |
| openai_chat_completion | AsyncOpenAI |
## Test Plan
smoke test with -
```
$ OPENAI_API_KEY=$LLAMA_API_KEY OPENAI_BASE_URL=https://api.llama.com/compat/v1 llama stack build --image-type conda --image-name openai --providers inference=remote::openai --run
$ llama-stack-client models register Llama-4-Scout-17B-16E-Instruct-FP8
$ curl "http://localhost:8321/v1/openai/v1/chat/completions" -H "Content-Type: application/json" \ -d '{
"model": "Llama-4-Scout-17B-16E-Instruct-FP8",
"messages": [
{"role": "user", "content": "Hello Llama! Can you give me a quick intro?"}
]
}'
{"id":"AmPwrrkc5JgVjejPdIPrpT2","choices":[{"finish_reason":"stop","index":0,"logprobs":{"content":null,"refusal":null},"message":{"content":"Hello! I'm Llama, a Meta-designed model that adapts to your conversational style. Whether you need quick answers, deep dives into ideas, or just want to vent, joke, or brainstorm—I'm here for it. What’s on your mind?","refusal":"","role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null,"id":"AmPwrrkc5JgVjejPdIPrpT2"}}],"created":1747410061,"model":"Llama-4-Scout-17B-16E-Instruct-FP8","object":"chat.completions","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":54,"prompt_tokens":22,"total_tokens":76,"completion_tokens_details":null,"prompt_tokens_details":null}}
```
and run full test suite.
note: the openai provider exposes the litellm specific model names to
the user. this change is compatible with that. the litellm names should
be deprecated.
# What does this PR do?
The goal of this PR is code base modernization.
Schema reflection code needed a minor adjustment to handle UnionTypes
and collections.abc.AsyncIterator. (Both are preferred for latest Python
releases.)
Note to reviewers: almost all changes here are automatically generated
by pyupgrade. Some additional unused imports were cleaned up. The only
change worth of note can be found under `docs/openapi_generator` and
`llama_stack/strong_typing/schema.py` where reflection code was updated
to deal with "newer" types.
Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
Groq has never supported raw completions anyhow. So this makes it easier
to switch it to LiteLLM. All our test suite passes.
I also updated all the openai-compat providers so they work with api
keys passed from headers. `provider_data`
## Test Plan
```bash
LLAMA_STACK_CONFIG=groq \
pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--inference-model=groq/llama-3.3-70b-versatile --vision-inference-model=""
```
Also tested (openai, anthropic, gemini) providers. No regressions.
# What does this PR do?
This PR introduces more non-llama model support to llama stack.
Providers introduced: openai, anthropic and gemini. All of these
providers use essentially the same piece of code -- the implementation
works via the `litellm` library.
We will expose only specific models for providers we enable making sure
they all work well and pass tests. This setup (instead of automatically
enabling _all_ providers and models allowed by LiteLLM) ensures we can
also perform any needed prompt tuning on a per-model basis as needed
(just like we do it for llama models.)
## Test Plan
```bash
#!/bin/bash
args=("$@")
for model in openai/gpt-4o anthropic/claude-3-5-sonnet-latest gemini/gemini-1.5-flash; do
LLAMA_STACK_CONFIG=dev pytest -s -v tests/client-sdk/inference/test_text_inference.py \
--embedding-model=all-MiniLM-L6-v2 \
--vision-inference-model="" \
--inference-model=$model "${args[@]}"
done
```