# What does this PR do? Addresses issue #679 - Adds support for the response_format field for chat completions and completions so users can get their outputs in JSON ## Test Plan <details> <summary>Integration tests</summary> `pytest llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output -k ollama -s -v` ```python llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-ollama] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-ollama] PASSED ================================== 2 passed, 18 deselected, 3 warnings in 41.41s ================================== ``` </details> <details> <summary>Manual Tests</summary> ``` export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct export OLLAMA_INFERENCE_MODEL=llama3.2:3b-instruct-fp16 export LLAMA_STACK_PORT=5000 ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m llama stack build --template ollama --image-type conda llama stack run ./run.yaml \ --port $LLAMA_STACK_PORT \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ --env OLLAMA_URL=http://localhost:11434 ``` ```python client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}") MODEL_ID=meta-llama/Llama-3.2-3B-Instruct prompt =f""" Create a step by step plan to complete the task of creating a codebase that is a web server that has an API endpoint that translates text from English to French. You have 3 different operations you can perform. You can create a file, update a file, or delete a file. Limit your step by step plan to only these operations per step. Don't create more than 10 steps. Please ensure there's a README.md file in the root of the codebase that describes the codebase and how to run it. Please ensure there's a requirements.txt file in the root of the codebase that describes the dependencies of the codebase. """ response = client.inference.chat_completion( model_id=MODEL_ID, messages=[ {"role": "user", "content": prompt}, ], sampling_params={ "max_tokens": 200000, }, response_format={ "type": "json_schema", "json_schema": { "$schema": "http://json-schema.org/draft-07/schema#", "title": "Plan", "description": f"A plan to complete the task of creating a codebase that is a web server that has an API endpoint that translates text from English to French.", "type": "object", "properties": { "steps": { "type": "array", "items": { "type": "string" } } }, "required": ["steps"], "additionalProperties": False, } }, stream=True, ) content = "" for chunk in response: if chunk.event.delta: print(chunk.event.delta, end="", flush=True) content += chunk.event.delta try: plan = json.loads(content) print(plan) except Exception as e: print(f"Error parsing plan into JSON: {e}") plan = {"steps": []} ``` Outputs: ```json { "steps": [ "Update the requirements.txt file to include the updated dependencies specified in the peer's feedback, including the Google Cloud Translation API key.", "Update the app.py file to address the code smells and incorporate the suggested improvements, such as handling errors and exceptions, initializing the Translator object correctly, adding input validation, using type hints and docstrings, and removing unnecessary logging statements.", "Create a README.md file that describes the codebase and how to run it.", "Ensure the README.md file is up-to-date and accurate.", "Update the requirements.txt file to reflect any additional dependencies specified by the peer's feedback.", "Add documentation for each function in the app.py file using docstrings.", "Implement logging statements throughout the app.py file to monitor application execution.", "Test the API endpoint to ensure it correctly translates text from English to French and handles errors properly.", "Refactor the code to follow PEP 8 style guidelines and ensure consistency in naming conventions, indentation, and spacing.", "Create a new folder for logs and add a logging configuration file (e.g., logconfig.json) that specifies the logging level and output destination.", "Deploy the web server on a production environment (e.g., AWS Elastic Beanstalk or Google Cloud Platform) to make it accessible to external users." ] } ``` </details> ## Sources - Ollama api docs: https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion - Ollama structured output docs: https://github.com/ollama/ollama/blob/main/docs/api.md#request-structured-outputs ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [x] Wrote necessary unit or integration tests. |
||
---|---|---|
.. | ||
agents | ||
datasetio | ||
eval | ||
inference | ||
memory | ||
post_training | ||
safety | ||
scoring | ||
__init__.py | ||
conftest.py | ||
env.py | ||
README.md | ||
resolver.py |
Testing Llama Stack Providers
The Llama Stack is designed as a collection of Lego blocks -- various APIs -- which are composable and can be used to quickly and reliably build an app. We need a testing setup which is relatively flexible to enable easy combinations of these providers.
We use pytest
and all of its dynamism to enable the features needed. Specifically:
-
We use
pytest_addoption
to add CLI options allowing you to override providers, models, etc. -
We use
pytest_generate_tests
to dynamically parametrize our tests. This allows us to support a default set of (providers, models, etc.) combinations but retain the flexibility to override them via the CLI if needed. -
We use
pytest_configure
to make sure we dynamically add appropriate marks based on the fixtures we make.
Common options
All tests support a --providers
option which can be a string of the form api1=provider_fixture1,api2=provider_fixture2
. So, when testing safety (which need inference and safety APIs) you can use --providers inference=together,safety=meta_reference
to use these fixtures in concert.
Depending on the API, there are custom options enabled. For example, inference
tests allow for an --inference-model
override, etc.
By default, we disable warnings and enable short tracebacks. You can override them using pytest's flags as appropriate.
Some providers need special API keys or other configuration options to work. You can check out the individual fixtures (located in tests/<api>/fixtures.py
) for what these keys are. These can be specified using the --env
CLI option. You can also have it be present in the environment (exporting in your shell) or put it in the .env
file in the directory from which you run the test. For example, to use the Together fixture you can use --env TOGETHER_API_KEY=<...>
Inference
We have the following orthogonal parametrizations (pytest "marks") for inference tests:
- providers: (meta_reference, together, fireworks, ollama)
- models: (llama_8b, llama_3b)
If you want to run a test with the llama_8b model with fireworks, you can use:
pytest -s -v llama_stack/providers/tests/inference/test_text_inference.py \
-m "fireworks and llama_8b" \
--env FIREWORKS_API_KEY=<...>
You can make it more complex to run both llama_8b and llama_3b on Fireworks, but only llama_3b with Ollama:
pytest -s -v llama_stack/providers/tests/inference/test_text_inference.py \
-m "fireworks or (ollama and llama_3b)" \
--env FIREWORKS_API_KEY=<...>
Finally, you can override the model completely by doing:
pytest -s -v llama_stack/providers/tests/inference/test_text_inference.py \
-m fireworks \
--inference-model "meta-llama/Llama3.1-70B-Instruct" \
--env FIREWORKS_API_KEY=<...>
Agents
The Agents API composes three other APIs underneath:
- Inference
- Safety
- Memory
Given that each of these has several fixtures each, the set of combinations is large. We provide a default set of combinations (see tests/agents/conftest.py
) with easy to use "marks":
meta_reference
-- uses all themeta_reference
fixtures for the dependent APIstogether
-- uses Together for inference, andmeta_reference
for the restollama
-- uses Ollama for inference, andmeta_reference
for the rest
An example test with Together:
pytest -s -m together llama_stack/providers/tests/agents/test_agents.py \
--env TOGETHER_API_KEY=<...>
If you want to override the inference model or safety model used, you can use the --inference-model
or --safety-shield
CLI options as appropriate.
If you wanted to test a remotely hosted stack, you can use -m remote
as follows:
pytest -s -m remote llama_stack/providers/tests/agents/test_agents.py \
--env REMOTE_STACK_URL=<...>