# What does this PR do? This stubs in some OpenAI server-side compatibility with three new endpoints: /v1/openai/v1/models /v1/openai/v1/completions /v1/openai/v1/chat/completions This gives common inference apps using OpenAI clients the ability to talk to Llama Stack using an endpoint like http://localhost:8321/v1/openai/v1 . The two "v1" instances in there isn't awesome, but the thinking is that Llama Stack's API is v1 and then our OpenAI compatibility layer is compatible with OpenAI V1. And, some OpenAI clients implicitly assume the URL ends with "v1", so this gives maximum compatibility. The openai models endpoint is implemented in the routing layer, and just returns all the models Llama Stack knows about. The following providers should be working with the new OpenAI completions and chat/completions API: * remote::anthropic (untested) * remote::cerebras-openai-compat (untested) * remote::fireworks (tested) * remote::fireworks-openai-compat (untested) * remote::gemini (untested) * remote::groq-openai-compat (untested) * remote::nvidia (tested) * remote::ollama (tested) * remote::openai (untested) * remote::passthrough (untested) * remote::sambanova-openai-compat (untested) * remote::together (tested) * remote::together-openai-compat (untested) * remote::vllm (tested) The goal to support this for every inference provider - proxying directly to the provider's OpenAI endpoint for OpenAI-compatible providers. For providers that don't have an OpenAI-compatible API, we'll add a mixin to translate incoming OpenAI requests to Llama Stack inference requests and translate the Llama Stack inference responses to OpenAI responses. This is related to #1817 but is a bit larger in scope than just chat completions, as I have real use-cases that need the older completions API as well. ## Test Plan ### vLLM ``` VLLM_URL="http://localhost:8000/v1" INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" llama stack build --template remote-vllm --image-type venv --run LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "meta-llama/Llama-3.2-3B-Instruct" ``` ### ollama ``` INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" llama stack build --template ollama --image-type venv --run LLAMA_STACK_CONFIG=http://localhost:8321 INFERENCE_MODEL="llama3.2:3b-instruct-q8_0" python -m pytest -v tests/integration/inference/test_openai_completion.py --text-model "llama3.2:3b-instruct-q8_0" ``` ## Documentation Run a Llama Stack distribution that uses one of the providers mentioned in the list above. Then, use your favorite OpenAI client to send completion or chat completion requests with the base_url set to http://localhost:8321/v1/openai/v1 . Replace "localhost:8321" with the host and port of your Llama Stack server, if different. --------- Signed-off-by: Ben Browning <bbrownin@redhat.com> |
||
---|---|---|
.. | ||
agents | ||
datasets | ||
eval | ||
fixtures | ||
inference | ||
inspect | ||
post_training | ||
providers | ||
safety | ||
scoring | ||
telemetry | ||
test_cases | ||
tool_runtime | ||
tools | ||
vector_io | ||
__init__.py | ||
conftest.py | ||
metadata.py | ||
README.md | ||
report.py |
Llama Stack Integration Tests
We use pytest
for parameterizing and running tests. You can see all options with:
cd tests/integration
# this will show a long list of options, look for "Custom options:"
pytest --help
Here are the most important options:
--stack-config
: specify the stack config to use. You have three ways to point to a stack:- a URL which points to a Llama Stack distribution server
- a template (e.g.,
fireworks
,together
) or a path to a run.yaml file - a comma-separated list of api=provider pairs, e.g.
inference=fireworks,safety=llama-guard,agents=meta-reference
. This is most useful for testing a single API surface.
--env
: set environment variables, e.g. --env KEY=value. this is a utility option to set environment variables required by various providers.
Model parameters can be influenced by the following options:
--text-model
: comma-separated list of text models.--vision-model
: comma-separated list of vision models.--embedding-model
: comma-separated list of embedding models.--safety-shield
: comma-separated list of safety shields.--judge-model
: comma-separated list of judge models.--embedding-dimension
: output dimensionality of the embedding model to use for testing. Default: 384
Each of these are comma-separated lists and can be used to generate multiple parameter combinations. Note that tests will be skipped if no model is specified.
Experimental, under development, options:
--record-responses
: record new API responses instead of using cached ones--report
: path where the test report should be written, e.g. --report=/path/to/report.md
Examples
Run all text inference tests with the together
distribution:
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config=together \
--text-model=meta-llama/Llama-3.1-8B-Instruct
Run all text inference tests with the together
distribution and meta-llama/Llama-3.1-8B-Instruct
:
pytest -s -v tests/integration/inference/test_text_inference.py \
--stack-config=together \
--text-model=meta-llama/Llama-3.1-8B-Instruct
Running all inference tests for a number of models:
TEXT_MODELS=meta-llama/Llama-3.1-8B-Instruct,meta-llama/Llama-3.1-70B-Instruct
VISION_MODELS=meta-llama/Llama-3.2-11B-Vision-Instruct
EMBEDDING_MODELS=all-MiniLM-L6-v2
export TOGETHER_API_KEY=<together_api_key>
pytest -s -v tests/integration/inference/ \
--stack-config=together \
--text-model=$TEXT_MODELS \
--vision-model=$VISION_MODELS \
--embedding-model=$EMBEDDING_MODELS
Same thing but instead of using the distribution, use an adhoc stack with just one provider (fireworks
for inference):
export FIREWORKS_API_KEY=<fireworks_api_key>
pytest -s -v tests/integration/inference/ \
--stack-config=inference=fireworks \
--text-model=$TEXT_MODELS \
--vision-model=$VISION_MODELS \
--embedding-model=$EMBEDDING_MODELS
Running Vector IO tests for a number of embedding models:
EMBEDDING_MODELS=all-MiniLM-L6-v2
pytest -s -v tests/integration/vector_io/ \
--stack-config=inference=sentence-transformers,vector_io=sqlite-vec \
--embedding-model=$EMBEDDING_MODELS