mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-07-19 11:20:03 +00:00
13 commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
|
b57db11bed
|
feat: create dynamic model registration for OpenAI and Llama compat remote inference providers (#2745)
Some checks failed
Vector IO Integration Tests / test-matrix (3.12, inline::faiss) (push) Failing after 5s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 6s
Vector IO Integration Tests / test-matrix (3.13, inline::milvus) (push) Failing after 4s
Python Package Build Test / build (3.13) (push) Failing after 2s
Test Llama Stack Build / generate-matrix (push) Successful in 6s
Vector IO Integration Tests / test-matrix (3.13, inline::sqlite-vec) (push) Failing after 7s
Vector IO Integration Tests / test-matrix (3.12, inline::sqlite-vec) (push) Failing after 9s
Update ReadTheDocs / update-readthedocs (push) Failing after 3s
Test Llama Stack Build / build-single-provider (push) Failing after 7s
Integration Tests / discover-tests (push) Successful in 13s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 13s
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 12s
Vector IO Integration Tests / test-matrix (3.12, remote::pgvector) (push) Failing after 17s
Vector IO Integration Tests / test-matrix (3.13, remote::pgvector) (push) Failing after 15s
Integration Tests / test-matrix (push) Failing after 5s
Unit Tests / unit-tests (3.12) (push) Failing after 12s
Vector IO Integration Tests / test-matrix (3.12, remote::chromadb) (push) Failing after 19s
Vector IO Integration Tests / test-matrix (3.12, inline::milvus) (push) Failing after 19s
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 22s
Vector IO Integration Tests / test-matrix (3.13, remote::chromadb) (push) Failing after 17s
Test External Providers / test-external-providers (venv) (push) Failing after 17s
Test Llama Stack Build / build (push) Failing after 14s
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 35s
Python Package Build Test / build (3.12) (push) Failing after 51s
Vector IO Integration Tests / test-matrix (3.13, inline::faiss) (push) Failing after 57s
Unit Tests / unit-tests (3.13) (push) Failing after 53s
Pre-commit / pre-commit (push) Successful in 1m42s
# What does this PR do? <!-- Provide a short summary of what this PR does and why. Link to relevant issues if applicable. --> The purpose of this task is to create a solution that can automatically detect when new models are added, deprecated, or removed by OpenAI and Llama API providers, and automatically update the list of supported models in LLamaStack. This feature is vitally important in order to avoid missing new models and editing the entries manually hence I created automation allowing users to dynamically register: - any models from OpenAI provider available at [https://api.openai.com/v1/models](https://api.openai.com/v1/models) that are not in [https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/inference/openai/models.py](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/inference/openai/models.py) - any models from Llama API provider available at [https://api.llama.com/v1/models](https://api.llama.com/v1/models) that are not in [https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/inference/llama_openai_compat/models.py](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/inference/llama_openai_compat/models.py) <!-- If resolving an issue, uncomment and update the line below --> <!-- Closes #[issue-number] --> Closes #2504 this PR is dependant on #2710 ## Test Plan <!-- Describe the tests you ran to verify your changes with result summaries. *Provide clear instructions so the plan can be easily re-executed.* --> 1. Create venv at root llamastack directory: `uv venv .venv --python 3.12 --seed` 2. Activate venv: `source .venv/bin/activate` 3. `uv pip install -e .` 4. Create OpenAI distro modifying run.yaml 5. Build distro: `llama stack build --template starter --image-type venv` 6. Then run LlamaStack, but before navigate to templates/starter folder: `llama stack run run.yaml --image-type venv OPENAI_API_KEY=<YOUR_KEY> ENABLE_OPENAI=openai` 7. Then try to register dummy llm that doesn't exist in OpenAI provider: ` llama-stack-client models register ianm/ianllm --provider-model-id=ianllm --provider-id=openai ` You should receive this output - combined list of static config + fetched available models from OpenAI: <img width="1380" height="474" alt="Screenshot 2025-07-14 at 12 48 50" src="https://github.com/user-attachments/assets/d26aad18-6b15-49ee-9c49-b01b2d33f883" /> 8. Then register real llm from OpenAI: llama-stack-client models register openai/gpt-4-turbo-preview --provider-model-id=gpt-4-turbo-preview --provider-id=openai <img width="1253" height="613" alt="Screenshot 2025-07-14 at 13 43 02" src="https://github.com/user-attachments/assets/60a5c9b1-3468-4eb9-9e92-cd7d21de3ca0" /> <img width="1288" height="655" alt="Screenshot 2025-07-14 at 13 43 11" src="https://github.com/user-attachments/assets/c1e48871-0e24-4bd9-a0b8-8c95552a51ee" /> We correctly fetched all available models from OpenAI As for Llama API, as a non-US person I don't have access to Llama API Key but I joined wait list. The implementation for Llama is the same as for OpenAI since Llama is openai compatible. So, the response from GET endpoint has the same structure as OpenAI https://llama.developer.meta.com/docs/api/models |
||
|
51d9fd4808
|
fix: Don't cache clients for passthrough auth providers (#2728)
Some checks failed
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 43s
Unit Tests / unit-tests (3.12) (push) Failing after 45s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 2s
Vector IO Integration Tests / test-matrix (3.12, inline::milvus) (push) Failing after 4s
Integration Tests / discover-tests (push) Successful in 6s
Vector IO Integration Tests / test-matrix (3.12, remote::pgvector) (push) Failing after 7s
Pre-commit / pre-commit (push) Successful in 2m8s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 4s
Test Llama Stack Build / generate-matrix (push) Successful in 5s
Vector IO Integration Tests / test-matrix (3.13, remote::pgvector) (push) Failing after 7s
Vector IO Integration Tests / test-matrix (3.12, inline::sqlite-vec) (push) Failing after 9s
Vector IO Integration Tests / test-matrix (3.13, inline::faiss) (push) Failing after 9s
Vector IO Integration Tests / test-matrix (3.12, inline::faiss) (push) Failing after 11s
SqlStore Integration Tests / test-postgres (3.13) (push) Failing after 12s
Test Llama Stack Build / build-single-provider (push) Failing after 7s
Python Package Build Test / build (3.13) (push) Failing after 5s
Python Package Build Test / build (3.12) (push) Failing after 7s
Unit Tests / unit-tests (3.13) (push) Failing after 6s
Vector IO Integration Tests / test-matrix (3.12, remote::chromadb) (push) Failing after 13s
Test External Providers / test-external-providers (venv) (push) Failing after 7s
Vector IO Integration Tests / test-matrix (3.13, inline::milvus) (push) Failing after 11s
Vector IO Integration Tests / test-matrix (3.13, inline::sqlite-vec) (push) Failing after 12s
Update ReadTheDocs / update-readthedocs (push) Failing after 6s
Integration Tests / test-matrix (push) Failing after 6s
Test Llama Stack Build / build (push) Failing after 4s
Vector IO Integration Tests / test-matrix (3.13, remote::chromadb) (push) Failing after 12s
SqlStore Integration Tests / test-postgres (3.12) (push) Failing after 16s
# What does this PR do? Some of our inference providers support passthrough authentication via `x-llamastack-provider-data` header values. This fixes the providers that support passthrough auth to not cache their clients to the backend providers (mostly OpenAI client instances) so that the client connecting to Llama Stack has to provide those auth values on each and every request. ## Test Plan I added some unit tests to ensure we're not caching clients across requests for all the fixed providers in this PR. ``` uv run pytest -sv tests/unit/providers/inference/test_inference_client_caching.py ``` I also ran some of our OpenAI compatible API integration tests for each of the changed providers, just to ensure they still work. Note that these providers don't actually pass all these tests (for unrelated reasons due to quirks of the Groq and Together SaaS services), but enough of the tests passed to confirm the clients are still working as intended. ### Together ``` ENABLE_TOGETHER="together" \ uv run llama stack run llama_stack/templates/starter/run.yaml LLAMA_STACK_CONFIG=http://localhost:8321 \ uv run pytest -sv \ tests/integration/inference/test_openai_completion.py \ --text-model "together/meta-llama/Llama-3.1-8B-Instruct" ``` ### OpenAI ``` ENABLE_OPENAI="openai" \ uv run llama stack run llama_stack/templates/starter/run.yaml LLAMA_STACK_CONFIG=http://localhost:8321 \ uv run pytest -sv \ tests/integration/inference/test_openai_completion.py \ --text-model "openai/gpt-4o-mini" ``` ### Groq ``` ENABLE_GROQ="groq" \ uv run llama stack run llama_stack/templates/starter/run.yaml LLAMA_STACK_CONFIG=http://localhost:8321 \ uv run pytest -sv \ tests/integration/inference/test_openai_completion.py \ --text-model "groq/meta-llama/Llama-3.1-8B-Instruct" ``` --------- Signed-off-by: Ben Browning <bbrownin@redhat.com> |
||
|
9b7eecebcf
|
ci: test safety with starter (#2628)
Some checks failed
Integration Tests / test-matrix (server, 3.13, inspect) (push) Failing after 7s
Integration Tests / test-matrix (server, 3.13, providers) (push) Failing after 11s
Integration Tests / test-matrix (server, 3.13, vector_io) (push) Failing after 10s
Integration Tests / test-matrix (server, 3.13, scoring) (push) Failing after 15s
Vector IO Integration Tests / test-matrix (3.12, inline::faiss) (push) Failing after 15s
Vector IO Integration Tests / test-matrix (3.12, inline::milvus) (push) Failing after 13s
Vector IO Integration Tests / test-matrix (3.12, inline::sqlite-vec) (push) Failing after 11s
Vector IO Integration Tests / test-matrix (3.12, remote::chromadb) (push) Failing after 10s
Vector IO Integration Tests / test-matrix (3.13, inline::faiss) (push) Failing after 7s
Integration Tests / test-matrix (server, 3.13, safety) (push) Failing after 25s
Integration Tests / test-matrix (server, 3.13, post_training) (push) Failing after 27s
Integration Tests / test-matrix (server, 3.13, tool_runtime) (push) Failing after 23s
Vector IO Integration Tests / test-matrix (3.12, remote::pgvector) (push) Failing after 15s
Vector IO Integration Tests / test-matrix (3.13, inline::sqlite-vec) (push) Failing after 7s
Vector IO Integration Tests / test-matrix (3.13, inline::milvus) (push) Failing after 9s
Test Llama Stack Build / generate-matrix (push) Successful in 14s
Vector IO Integration Tests / test-matrix (3.13, remote::pgvector) (push) Failing after 16s
Test Llama Stack Build / build-single-provider (push) Failing after 14s
Integration Tests / test-matrix (server, 3.12, tool_runtime) (push) Failing after 1m7s
Update ReadTheDocs / update-readthedocs (push) Failing after 12s
Unit Tests / unit-tests (3.13) (push) Failing after 14s
Test Llama Stack Build / build-ubi9-container-distribution (push) Failing after 29s
Test External Providers / test-external-providers (venv) (push) Failing after 17s
Test Llama Stack Build / build (push) Failing after 13s
Unit Tests / unit-tests (3.12) (push) Failing after 15s
Vector IO Integration Tests / test-matrix (3.13, remote::chromadb) (push) Failing after 35s
Python Package Build Test / build (3.12) (push) Failing after 31s
Python Package Build Test / build (3.13) (push) Failing after 29s
Test Llama Stack Build / build-custom-container-distribution (push) Failing after 34s
Pre-commit / pre-commit (push) Successful in 1m24s
# What does this PR do? We are now testing the safety capability with the starter image. This includes a few changes: * Enable the safety integration test * Relax the shield model requirements from llama-guard to make it work with llama-guard3:8b coming from Ollama * Expose a shield for each inference provider in the starter distro. The shield will only be registered if the provider is enabled. Closes: https://github.com/meta-llama/llama-stack/issues/2528 Signed-off-by: Sébastien Han <seb@redhat.com> |
||
|
ac5fd57387
|
chore: remove nested imports (#2515)
# What does this PR do? * Given that our API packages use "import *" in `__init.py__` we don't need to do `from llama_stack.apis.models.models` but simply from llama_stack.apis.models. The decision to use `import *` is debatable and should probably be revisited at one point. * Remove unneeded Ruff F401 rule * Consolidate Ruff F403 rule in the pyprojectfrom llama_stack.apis.models.models Signed-off-by: Sébastien Han <seb@redhat.com> |
||
|
985d0b156c
|
feat: Add suffix to openai_completions (#2449)
Some checks failed
Integration Tests / test-matrix (library, 3.10, inspect) (push) Failing after 9s
Integration Tests / test-matrix (http, 3.11, providers) (push) Failing after 5s
Integration Tests / test-matrix (library, 3.10, providers) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.10, post_training) (push) Failing after 10s
Integration Tests / test-matrix (library, 3.10, scoring) (push) Failing after 13s
Integration Tests / test-matrix (library, 3.11, agents) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.11, inference) (push) Failing after 6s
Integration Tests / test-matrix (library, 3.11, datasets) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.11, inspect) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.11, post_training) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.11, providers) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.11, scoring) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.11, vector_io) (push) Failing after 8s
Integration Tests / test-matrix (library, 3.11, tool_runtime) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, agents) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, datasets) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, inspect) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, inference) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, providers) (push) Failing after 7s
Integration Tests / test-matrix (library, 3.12, post_training) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, scoring) (push) Failing after 11s
Integration Tests / test-matrix (library, 3.12, tool_runtime) (push) Failing after 9s
Test External Providers / test-external-providers (venv) (push) Failing after 9s
Integration Tests / test-matrix (library, 3.12, vector_io) (push) Failing after 14s
Unit Tests / unit-tests (3.10) (push) Failing after 19s
Unit Tests / unit-tests (3.11) (push) Failing after 20s
Unit Tests / unit-tests (3.12) (push) Failing after 18s
Unit Tests / unit-tests (3.13) (push) Failing after 16s
Update ReadTheDocs / update-readthedocs (push) Failing after 8s
Pre-commit / pre-commit (push) Successful in 58s
For code completion apps need "fill in the middle" capabilities. Added option of `suffix` to `openai_completion` to enable this. Updated ollama provider to showcase the same. ### Test Plan ``` pytest -sv --stack-config="inference=ollama" tests/integration/inference/test_openai_completion.py --text-model qwen2.5-coder:1.5b -k test_openai_completion_non_streaming_suffix ``` ### OpenAI Sample script ``` from openai import OpenAI client = OpenAI(base_url="http://localhost:8321/v1/openai/v1") response = client.completions.create( model="qwen2.5-coder:1.5b", prompt="The capital of ", suffix="is Paris.", max_tokens=10, ) print(response.choices[0].text) ``` ### Output ``` France is ____. To answer this question, we ``` |
||
|
b21050935e
|
feat: New OpenAI compat embeddings API (#2314)
Some checks failed
Integration Tests / test-matrix (http, agents) (push) Failing after 9s
Integration Tests / test-matrix (http, scoring) (push) Failing after 9s
Integration Tests / test-matrix (library, inference) (push) Failing after 9s
Integration Tests / test-matrix (library, inspect) (push) Failing after 9s
Integration Tests / test-matrix (library, post_training) (push) Failing after 15s
Integration Tests / test-matrix (library, providers) (push) Failing after 14s
Integration Auth Tests / test-matrix (oauth2_token) (push) Failing after 43s
Integration Tests / test-matrix (library, scoring) (push) Failing after 8s
Integration Tests / test-matrix (http, inference) (push) Failing after 46s
Integration Tests / test-matrix (library, tool_runtime) (push) Failing after 8s
Integration Tests / test-matrix (library, agents) (push) Failing after 44s
Integration Tests / test-matrix (http, inspect) (push) Failing after 47s
Integration Tests / test-matrix (http, providers) (push) Failing after 45s
Integration Tests / test-matrix (library, datasets) (push) Failing after 45s
Integration Tests / test-matrix (http, post_training) (push) Failing after 46s
Integration Tests / test-matrix (http, tool_runtime) (push) Failing after 47s
Integration Tests / test-matrix (http, datasets) (push) Failing after 49s
Test External Providers / test-external-providers (venv) (push) Failing after 6s
Update ReadTheDocs / update-readthedocs (push) Failing after 6s
Unit Tests / unit-tests (3.12) (push) Failing after 7s
Unit Tests / unit-tests (3.10) (push) Failing after 8s
Unit Tests / unit-tests (3.11) (push) Failing after 8s
Unit Tests / unit-tests (3.13) (push) Failing after 7s
Pre-commit / pre-commit (push) Successful in 1m12s
# What does this PR do? Adds a new endpoint that is compatible with OpenAI for embeddings api. `/openai/v1/embeddings` Added providers for OpenAI, LiteLLM and SentenceTransformer. ## Test Plan ``` LLAMA_STACK_CONFIG=http://localhost:8321 pytest -sv tests/integration/inference/test_openai_embeddings.py --embedding-model all-MiniLM-L6-v2,text-embedding-3-small,gemini/text-embedding-004 ``` |
||
|
8feb1827c8
|
fix: openai provider model id (#2229)
# What does this PR do? Since https://github.com/meta-llama/llama-stack/pull/2193 switched to openai sdk, we need to strip 'openai/' from the model_id ## Test Plan start server with openai provider and send a chat completion call |
||
|
64f8d4c3ad
|
feat: use openai-python for openai inference provider (#2193)
# What does this PR do? fixes #2121 this implementation splits reponsibility between litellm and openai libraries - | Inference Method | Implementation Source | |----------------------------|--------------------------| | completion | LiteLLMOpenAIMixin | | chat_completion | LiteLLMOpenAIMixin | | embedding | LiteLLMOpenAIMixin | | batch_completion | LiteLLMOpenAIMixin | | batch_chat_completion | LiteLLMOpenAIMixin | | openai_completion | AsyncOpenAI | | openai_chat_completion | AsyncOpenAI | ## Test Plan smoke test with - ``` $ OPENAI_API_KEY=$LLAMA_API_KEY OPENAI_BASE_URL=https://api.llama.com/compat/v1 llama stack build --image-type conda --image-name openai --providers inference=remote::openai --run $ llama-stack-client models register Llama-4-Scout-17B-16E-Instruct-FP8 $ curl "http://localhost:8321/v1/openai/v1/chat/completions" -H "Content-Type: application/json" \ -d '{ "model": "Llama-4-Scout-17B-16E-Instruct-FP8", "messages": [ {"role": "user", "content": "Hello Llama! Can you give me a quick intro?"} ] }' {"id":"AmPwrrkc5JgVjejPdIPrpT2","choices":[{"finish_reason":"stop","index":0,"logprobs":{"content":null,"refusal":null},"message":{"content":"Hello! I'm Llama, a Meta-designed model that adapts to your conversational style. Whether you need quick answers, deep dives into ideas, or just want to vent, joke, or brainstorm—I'm here for it. What’s on your mind?","refusal":"","role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null,"id":"AmPwrrkc5JgVjejPdIPrpT2"}}],"created":1747410061,"model":"Llama-4-Scout-17B-16E-Instruct-FP8","object":"chat.completions","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":54,"prompt_tokens":22,"total_tokens":76,"completion_tokens_details":null,"prompt_tokens_details":null}} ``` and run full test suite. |
||
|
aa5bef8e05
|
feat: expand set of known openai models, allow using openai canonical model names (#2164)
note: the openai provider exposes the litellm specific model names to the user. this change is compatible with that. the litellm names should be deprecated. |
||
|
9e6561a1ec
|
chore: enable pyupgrade fixes (#1806)
# What does this PR do? The goal of this PR is code base modernization. Schema reflection code needed a minor adjustment to handle UnionTypes and collections.abc.AsyncIterator. (Both are preferred for latest Python releases.) Note to reviewers: almost all changes here are automatically generated by pyupgrade. Some additional unused imports were cleaned up. The only change worth of note can be found under `docs/openapi_generator` and `llama_stack/strong_typing/schema.py` where reflection code was updated to deal with "newer" types. Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com> |
||
|
928a39d17b
|
feat(providers): Groq now uses LiteLLM openai-compat (#1303)
Groq has never supported raw completions anyhow. So this makes it easier to switch it to LiteLLM. All our test suite passes. I also updated all the openai-compat providers so they work with api keys passed from headers. `provider_data` ## Test Plan ```bash LLAMA_STACK_CONFIG=groq \ pytest -s -v tests/client-sdk/inference/test_text_inference.py \ --inference-model=groq/llama-3.3-70b-versatile --vision-inference-model="" ``` Also tested (openai, anthropic, gemini) providers. No regressions. |
||
|
4cf95475e5 |
fix: make vision and embedding tests pass with openai, anthropic and gemini
NOTE - Anthropic embeddings do not work due to LiteLLM not supporting them. |
||
|
63e6acd0c3
|
feat: add (openai, anthropic, gemini) providers via litellm (#1267)
# What does this PR do? This PR introduces more non-llama model support to llama stack. Providers introduced: openai, anthropic and gemini. All of these providers use essentially the same piece of code -- the implementation works via the `litellm` library. We will expose only specific models for providers we enable making sure they all work well and pass tests. This setup (instead of automatically enabling _all_ providers and models allowed by LiteLLM) ensures we can also perform any needed prompt tuning on a per-model basis as needed (just like we do it for llama models.) ## Test Plan ```bash #!/bin/bash args=("$@") for model in openai/gpt-4o anthropic/claude-3-5-sonnet-latest gemini/gemini-1.5-flash; do LLAMA_STACK_CONFIG=dev pytest -s -v tests/client-sdk/inference/test_text_inference.py \ --embedding-model=all-MiniLM-L6-v2 \ --vision-inference-model="" \ --inference-model=$model "${args[@]}" done ``` |