# What does this PR do? Addresses issue #679 - Adds support for the response_format field for chat completions and completions so users can get their outputs in JSON ## Test Plan <details> <summary>Integration tests</summary> `pytest llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output -k ollama -s -v` ```python llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-ollama] PASSED llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_3b-ollama] PASSED ================================== 2 passed, 18 deselected, 3 warnings in 41.41s ================================== ``` </details> <details> <summary>Manual Tests</summary> ``` export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct export OLLAMA_INFERENCE_MODEL=llama3.2:3b-instruct-fp16 export LLAMA_STACK_PORT=5000 ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m llama stack build --template ollama --image-type conda llama stack run ./run.yaml \ --port $LLAMA_STACK_PORT \ --env INFERENCE_MODEL=$INFERENCE_MODEL \ --env OLLAMA_URL=http://localhost:11434 ``` ```python client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}") MODEL_ID=meta-llama/Llama-3.2-3B-Instruct prompt =f""" Create a step by step plan to complete the task of creating a codebase that is a web server that has an API endpoint that translates text from English to French. You have 3 different operations you can perform. You can create a file, update a file, or delete a file. Limit your step by step plan to only these operations per step. Don't create more than 10 steps. Please ensure there's a README.md file in the root of the codebase that describes the codebase and how to run it. Please ensure there's a requirements.txt file in the root of the codebase that describes the dependencies of the codebase. """ response = client.inference.chat_completion( model_id=MODEL_ID, messages=[ {"role": "user", "content": prompt}, ], sampling_params={ "max_tokens": 200000, }, response_format={ "type": "json_schema", "json_schema": { "$schema": "http://json-schema.org/draft-07/schema#", "title": "Plan", "description": f"A plan to complete the task of creating a codebase that is a web server that has an API endpoint that translates text from English to French.", "type": "object", "properties": { "steps": { "type": "array", "items": { "type": "string" } } }, "required": ["steps"], "additionalProperties": False, } }, stream=True, ) content = "" for chunk in response: if chunk.event.delta: print(chunk.event.delta, end="", flush=True) content += chunk.event.delta try: plan = json.loads(content) print(plan) except Exception as e: print(f"Error parsing plan into JSON: {e}") plan = {"steps": []} ``` Outputs: ```json { "steps": [ "Update the requirements.txt file to include the updated dependencies specified in the peer's feedback, including the Google Cloud Translation API key.", "Update the app.py file to address the code smells and incorporate the suggested improvements, such as handling errors and exceptions, initializing the Translator object correctly, adding input validation, using type hints and docstrings, and removing unnecessary logging statements.", "Create a README.md file that describes the codebase and how to run it.", "Ensure the README.md file is up-to-date and accurate.", "Update the requirements.txt file to reflect any additional dependencies specified by the peer's feedback.", "Add documentation for each function in the app.py file using docstrings.", "Implement logging statements throughout the app.py file to monitor application execution.", "Test the API endpoint to ensure it correctly translates text from English to French and handles errors properly.", "Refactor the code to follow PEP 8 style guidelines and ensure consistency in naming conventions, indentation, and spacing.", "Create a new folder for logs and add a logging configuration file (e.g., logconfig.json) that specifies the logging level and output destination.", "Deploy the web server on a production environment (e.g., AWS Elastic Beanstalk or Google Cloud Platform) to make it accessible to external users." ] } ``` </details> ## Sources - Ollama api docs: https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-completion - Ollama structured output docs: https://github.com/ollama/ollama/blob/main/docs/api.md#request-structured-outputs ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Ran pre-commit to handle lint / formatting issues. - [x] Read the [contributor guideline](https://github.com/meta-llama/llama-stack/blob/main/CONTRIBUTING.md), Pull Request section? - [ ] Updated relevant documentation. - [x] Wrote necessary unit or integration tests. |
||
---|---|---|
.github | ||
distributions | ||
docs | ||
llama_stack | ||
rfcs | ||
tests/client-sdk | ||
.flake8 | ||
.gitignore | ||
.gitmodules | ||
.pre-commit-config.yaml | ||
.readthedocs.yaml | ||
CHANGELOG.md | ||
CODE_OF_CONDUCT.md | ||
CONTRIBUTING.md | ||
LICENSE | ||
MANIFEST.in | ||
pyproject.toml | ||
README.md | ||
requirements.txt | ||
SECURITY.md | ||
setup.py |
Llama Stack
Quick Start | Documentation | Zero-to-Hero Guide
Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.
Our goal is to provide pre-packaged implementations which can be operated in a variety of deployment environments: developers start iterating with Desktops or their mobile devices and can seamlessly transition to on-prem or public cloud deployments. At every point in this transition, the same set of APIs and the same developer experience is available.
⚠️ Note The Stack APIs are rapidly improving, but still very much work in progress and we invite feedback as well as direct contributions.
APIs
We have working implementations of the following APIs today:
- Inference
- Safety
- Memory
- Agents
- Eval
- Telemetry
Alongside these APIs, we also related APIs for operating with associated resources (see Concepts):
- Models
- Shields
- Memory Banks
- Eval Tasks
- Datasets
- Scoring Functions
We are also working on the following APIs which will be released soon:
- Post Training
- Synthetic Data Generation
- Reward Scoring
Each of the APIs themselves is a collection of REST endpoints.
Philosophy
Service-oriented design
Unlike other frameworks, Llama Stack is built with a service-oriented, REST API-first approach. Such a design not only allows for seamless transitions from a local to remote deployments, but also forces the design to be more declarative. We believe this restriction can result in a much simpler, robust developer experience. This will necessarily trade-off against expressivity however if we get the APIs right, it can lead to a very powerful platform.
Composability
We expect the set of APIs we design to be composable. An Agent abstractly depends on { Inference, Memory, Safety } APIs but does not care about the actual implementation details. Safety itself may require model inference and hence can depend on the Inference API.
Turnkey one-stop solutions
We expect to provide turnkey solutions for popular deployment scenarios. It should be easy to deploy a Llama Stack server on AWS or on a private data center. Either of these should allow a developer to get started with powerful agentic apps, model evaluations or fine-tuning services in a matter of minutes. They should all result in the same uniform observability and developer experience.
Focus on Llama models
As a Meta initiated project, we have started by explicitly focusing on Meta's Llama series of models. Supporting the broad set of open models is no easy task and we want to start with models we understand best.
Supporting the Ecosystem
There is a vibrant ecosystem of Providers which provide efficient inference or scalable vector stores or powerful observability solutions. We want to make sure it is easy for developers to pick and choose the best implementations for their use cases. We also want to make sure it is easy for new Providers to onboard and participate in the ecosystem.
Additionally, we have designed every element of the Stack such that APIs as well as Resources (like Models) can be federated.
Supported Llama Stack Implementations
API Providers
API Provider Builder | Environments | Agents | Inference | Memory | Safety | Telemetry |
---|---|---|---|---|---|---|
Meta Reference | Single Node | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Cerebras | Hosted | ✔️ | ||||
Fireworks | Hosted | ✔️ | ✔️ | ✔️ | ||
AWS Bedrock | Hosted | ✔️ | ✔️ | |||
Together | Hosted | ✔️ | ✔️ | ✔️ | ||
Ollama | Single Node | ✔️ | ||||
TGI | Hosted and Single Node | ✔️ | ||||
NVIDIA NIM | Hosted and Single Node | ✔️ | ||||
Chroma | Single Node | ✔️ | ||||
PG Vector | Single Node | ✔️ | ||||
PyTorch ExecuTorch | On-device iOS | ✔️ | ✔️ | |||
vLLM | Hosted and Single Node | ✔️ |
Distributions
Distribution | Llama Stack Docker | Start This Distribution |
---|---|---|
Meta Reference | llamastack/distribution-meta-reference-gpu | Guide |
Meta Reference Quantized | llamastack/distribution-meta-reference-quantized-gpu | Guide |
Cerebras | llamastack/distribution-cerebras | Guide |
Ollama | llamastack/distribution-ollama | Guide |
TGI | llamastack/distribution-tgi | Guide |
Together | llamastack/distribution-together | Guide |
Fireworks | llamastack/distribution-fireworks | Guide |
vLLM | llamastack/distribution-remote-vllm | Guide |
Installation
You have two ways to install this repository:
-
Install as a package: You can install the repository directly from PyPI by running the following command:
pip install llama-stack
-
Install from source: If you prefer to install from the source code, make sure you have conda installed. Then, follow these steps:
mkdir -p ~/local cd ~/local git clone git@github.com:meta-llama/llama-stack.git conda create -n stack python=3.10 conda activate stack cd llama-stack pip install -e .
Documentation
Please checkout our Documentation page for more details.
- CLI reference
- Guide using
llama
CLI to work with Llama models (download, study prompts), and building/starting a Llama Stack distribution.
- Guide using
- Getting Started
- Quick guide to start a Llama Stack server.
- Jupyter notebook to walk-through how to use simple text and vision inference llama_stack_client APIs
- The complete Llama Stack lesson Colab notebook of the new Llama 3.2 course on Deeplearning.ai.
- A Zero-to-Hero Guide that guide you through all the key components of llama stack with code samples.
- Contributing
- Adding a new API Provider to walk-through how to add a new API provider.
Llama Stack Client SDKs
Language | Client SDK | Package |
---|---|---|
Python | llama-stack-client-python | |
Swift | llama-stack-client-swift | |
Node | llama-stack-client-node | |
Kotlin | llama-stack-client-kotlin |
Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from python, node, swift, and kotlin programming languages to quickly build your applications.
You can find more example scripts with client SDKs to talk with the Llama Stack server in our llama-stack-apps repo.