Merge conflicts

This commit is contained in:
Chantal D Gama Rose 2024-12-03 19:00:27 -08:00
commit 5b027d2de5
198 changed files with 6140 additions and 3477 deletions

View file

@ -1,14 +0,0 @@
# API Providers
A Provider is what makes the API real -- they provide the actual implementation backing the API.
As an example, for Inference, we could have the implementation be backed by open source libraries like `[ torch | vLLM | TensorRT ]` as possible options.
A provider can also be just a pointer to a remote REST service -- for example, cloud providers or dedicated inference providers could serve these APIs.
```{toctree}
:maxdepth: 1
new_api_provider
memory_api
```

View file

@ -0,0 +1,15 @@
# Building Applications
```{admonition} Work in Progress
:class: warning
## What can you do with the Stack?
- Agents
- what is a turn? session?
- inference
- memory / RAG; pre-ingesting content or attaching content in a turn
- how does tool calling work
- can you do evaluation?
```

View file

@ -0,0 +1,64 @@
# Core Concepts
Given Llama Stack's service-oriented philosophy, a few concepts and workflows arise which may not feel completely natural in the LLM landscape, especially if you are coming with a background in other frameworks.
## APIs
A Llama Stack API is described as a collection of REST endpoints. We currently support the following APIs:
- **Inference**: run inference with a LLM
- **Safety**: apply safety policies to the output at a Systems (not only model) level
- **Agents**: run multi-step agentic workflows with LLMs with tool usage, memory (RAG), etc.
- **Memory**: store and retrieve data for RAG, chat history, etc.
- **DatasetIO**: interface with datasets and data loaders
- **Scoring**: evaluate outputs of the system
- **Eval**: generate outputs (via Inference or Agents) and perform scoring
- **Telemetry**: collect telemetry data from the system
We are working on adding a few more APIs to complete the application lifecycle. These will include:
- **Batch Inference**: run inference on a dataset of inputs
- **Batch Agents**: run agents on a dataset of inputs
- **Post Training**: fine-tune a Llama model
- **Synthetic Data Generation**: generate synthetic data for model development
## API Providers
The goal of Llama Stack is to build an ecosystem where users can easily swap out different implementations for the same API. Obvious examples for these include
- LLM inference providers (e.g., Fireworks, Together, AWS Bedrock, etc.),
- Vector databases (e.g., ChromaDB, Weaviate, Qdrant, etc.),
- Safety providers (e.g., Meta's Llama Guard, AWS Bedrock Guardrails, etc.)
Providers come in two flavors:
- **Remote**: the provider runs as a separate service external to the Llama Stack codebase. Llama Stack contains a small amount of adapter code.
- **Inline**: the provider is fully specified and implemented within the Llama Stack codebase. It may be a simple wrapper around an existing library, or a full fledged implementation within Llama Stack.
## Resources
Some of these APIs are associated with a set of **Resources**. Here is the mapping of APIs to resources:
- **Inference**, **Eval** and **Post Training** are associated with `Model` resources.
- **Safety** is associated with `Shield` resources.
- **Memory** is associated with `Memory Bank` resources.
- **DatasetIO** is associated with `Dataset` resources.
- **Scoring** is associated with `ScoringFunction` resources.
- **Eval** is associated with `Model` and `EvalTask` resources.
Furthermore, we allow these resources to be **federated** across multiple providers. For example, you may have some Llama models served by Fireworks while others are served by AWS Bedrock. Regardless, they will all work seamlessly with the same uniform Inference API provided by Llama Stack.
```{admonition} Registering Resources
:class: tip
Given this architecture, it is necessary for the Stack to know which provider to use for a given resource. This means you need to explicitly _register_ resources (including models) before you can use them with the associated APIs.
```
## Distributions
While there is a lot of flexibility to mix-and-match providers, often users will work with a specific set of providers (hardware support, contractual obligations, etc.) We therefore need to provide a _convenient shorthand_ for such collections. We call this shorthand a **Llama Stack Distribution** or a **Distro**. One can think of it as specific pre-packaged versions of the Llama Stack. Here are some examples:
**Remotely Hosted Distro**: These are the simplest to consume from a user perspective. You can simply obtain the API key for these providers, point to a URL and have _all_ Llama Stack APIs working out of the box. Currently, [Fireworks](https://fireworks.ai/) and [Together](https://together.xyz/) provide such easy-to-consume Llama Stack distributions.
**Locally Hosted Distro**: You may want to run Llama Stack on your own hardware. Typically though, you still need to use Inference via an external service. You can use providers like HuggingFace TGI, Cerebras, Fireworks, Together, etc. for this purpose. Or you may have access to GPUs and can run a [vLLM](https://github.com/vllm-project/vllm) instance. If you "just" have a regular desktop machine, you can use [Ollama](https://ollama.com/) for inference. To provide convenient quick access to these options, we provide a number of such pre-configured locally-hosted Distros.
**On-device Distro**: Finally, you may want to run Llama Stack directly on an edge device (mobile phone or a tablet.) We provide Distros for iOS and Android (coming soon.)

View file

@ -12,6 +12,8 @@
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
from docutils import nodes
project = "llama-stack"
copyright = "2024, Meta"
author = "Meta"
@ -25,10 +27,12 @@ extensions = [
"sphinx_copybutton",
"sphinx_tabs.tabs",
"sphinx_design",
"sphinxcontrib.redoc",
]
myst_enable_extensions = ["colon_fence"]
html_theme = "sphinx_rtd_theme"
html_use_relative_paths = True
# html_theme = "sphinx_pdj_theme"
# html_theme_path = [sphinx_pdj_theme.get_html_theme_path()]
@ -57,6 +61,10 @@ myst_enable_extensions = [
"tasklist",
]
myst_substitutions = {
"docker_hub": "https://hub.docker.com/repository/docker/llamastack",
}
# Copy button settings
copybutton_prompt_text = "$ " # for bash prompts
copybutton_prompt_is_regexp = True
@ -79,6 +87,43 @@ html_theme_options = {
}
html_static_path = ["../_static"]
html_logo = "../_static/llama-stack-logo.png"
# html_logo = "../_static/llama-stack-logo.png"
html_style = "../_static/css/my_theme.css"
redoc = [
{
"name": "Llama Stack API",
"page": "references/api_reference/index",
"spec": "../resources/llama-stack-spec.yaml",
"opts": {
"suppress-warnings": True,
# "expand-responses": ["200", "201"],
},
"embed": True,
},
]
redoc_uri = "https://cdn.redoc.ly/redoc/latest/bundles/redoc.standalone.js"
def setup(app):
def dockerhub_role(name, rawtext, text, lineno, inliner, options={}, content=[]):
url = f"https://hub.docker.com/r/llamastack/{text}"
node = nodes.reference(rawtext, text, refuri=url, **options)
return [node], []
def repopath_role(name, rawtext, text, lineno, inliner, options={}, content=[]):
parts = text.split("::")
if len(parts) == 2:
link_text = parts[0]
url_path = parts[1]
else:
link_text = text
url_path = text
url = f"https://github.com/meta-llama/llama-stack/tree/main/{url_path}"
node = nodes.reference(rawtext, link_text, refuri=url, **options)
return [node], []
app.add_role("dockerhub", dockerhub_role)
app.add_role("repopath", repopath_role)

View file

@ -0,0 +1,9 @@
# Contributing to Llama Stack
```{toctree}
:maxdepth: 1
new_api_provider
memory_api
```

View file

@ -1,20 +1,19 @@
# Developer Guide: Adding a New API Provider
# Adding a New API Provider
This guide contains references to walk you through adding a new API provider.
### Adding a new API provider
1. First, decide which API your provider falls into (e.g. Inference, Safety, Agents, Memory).
2. Decide whether your provider is a remote provider, or inline implmentation. A remote provider is a provider that makes a remote request to an service. An inline provider is a provider where implementation is executed locally. Checkout the examples, and follow the structure to add your own API provider. Please find the following code pointers:
- [Remote Adapters](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/remote)
- [Inline Providers](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline)
- {repopath}`Remote Providers::llama_stack/providers/remote`
- {repopath}`Inline Providers::llama_stack/providers/inline`
3. [Build a Llama Stack distribution](https://llama-stack.readthedocs.io/en/latest/distribution_dev/building_distro.html) with your API provider.
3. [Build a Llama Stack distribution](https://llama-stack.readthedocs.io/en/latest/distributions/building_distro.html) with your API provider.
4. Test your code!
### Testing your newly added API providers
## Testing your newly added API providers
1. Start with an _integration test_ for your provider. That means we will instantiate the real provider, pass it real configuration and if it is a remote service, we will actually hit the remote service. We **strongly** discourage mocking for these tests at the provider level. Llama Stack is first and foremost about integration so we need to make sure stuff works end-to-end. See [llama_stack/providers/tests/inference/test_inference.py](../llama_stack/providers/tests/inference/test_inference.py) for an example.
1. Start with an _integration test_ for your provider. That means we will instantiate the real provider, pass it real configuration and if it is a remote service, we will actually hit the remote service. We **strongly** discourage mocking for these tests at the provider level. Llama Stack is first and foremost about integration so we need to make sure stuff works end-to-end. See {repopath}`llama_stack/providers/tests/inference/test_text_inference.py` for an example.
2. In addition, if you want to unit test functionality within your provider, feel free to do so. You can find some tests in `tests/` but they aren't well supported so far.
@ -22,5 +21,6 @@ This guide contains references to walk you through adding a new API provider.
You can find more complex client scripts [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main) repo. Note down which scripts works and do not work with your distribution.
### Submit your PR
## Submit your PR
After you have fully tested your newly added API provider, submit a PR with the attached test plan. You must have a Test Plan in the summary section of your PR.

View file

@ -0,0 +1,123 @@
# Evaluations
The Llama Stack Evaluation flow allows you to run evaluations on your GenAI application datasets or pre-registered benchmarks.
We introduce a set of APIs in Llama Stack for supporting running evaluations of LLM applications.
- `/datasetio` + `/datasets` API
- `/scoring` + `/scoring_functions` API
- `/eval` + `/eval_tasks` API
This guide goes over the sets of APIs and developer experience flow of using Llama Stack to run evaluations for different use cases.
## Evaluation Concepts
The Evaluation APIs are associated with a set of Resources as shown in the following diagram. Please visit the Resources section in our [Core Concepts](../concepts/index.md) guide for better high-level understanding.
![Eval Concepts](./resources/eval-concept.png)
- **DatasetIO**: defines interface with datasets and data loaders.
- Associated with `Dataset` resource.
- **Scoring**: evaluate outputs of the system.
- Associated with `ScoringFunction` resource. We provide a suite of out-of-the box scoring functions and also the ability for you to add custom evaluators. These scoring functions are the core part of defining an evaluation task to output evaluation metrics.
- **Eval**: generate outputs (via Inference or Agents) and perform scoring.
- Associated with `EvalTask` resource.
## Running Evaluations
Use the following decision tree to decide how to use LlamaStack Evaluation flow.
![Eval Flow](./resources/eval-flow.png)
```{admonition} Note on Benchmark v.s. Application Evaluation
:class: tip
- **Benchmark Evaluation** is a well-defined eval-task consisting of `dataset` and `scoring_function`. The generation (inference or agent) will be done as part of evaluation.
- **Application Evaluation** assumes users already have app inputs & generated outputs. Evaluation will purely focus on scoring the generated outputs via scoring functions (e.g. LLM-as-judge).
```
The following examples give the quick steps to start running evaluations using the llama-stack-client CLI.
#### Benchmark Evaluation CLI
Usage: There are 2 inputs necessary for running a benchmark eval
- `eval-task-id`: the identifier associated with the eval task. Each `EvalTask` is parametrized by
- `dataset_id`: the identifier associated with the dataset.
- `List[scoring_function_id]`: list of scoring function identifiers.
- `eval-task-config`: specifies the configuration of the model / agent to evaluate on.
```
llama-stack-client eval run_benchmark <eval-task-id> \
--eval-task-config ~/eval_task_config.json \
--visualize
```
#### Application Evaluation CLI
Usage: For running application evals, you will already have available datasets in hand from your application. You will need to specify:
- `scoring-fn-id`: List of ScoringFunction identifiers you wish to use to run on your application.
- `Dataset` used for evaluation:
- (1) `--dataset-path`: path to local file system containing datasets to run evaluation on
- (2) `--dataset-id`: pre-registered dataset in Llama Stack
- (Optional) `--scoring-params-config`: optionally parameterize scoring functions with custom params (e.g. `judge_prompt`, `judge_model`, `parsing_regexes`).
```
llama-stack-client eval run_scoring <scoring_fn_id_1> <scoring_fn_id_2> ... <scoring_fn_id_n>
--dataset-path <path-to-local-dataset> \
--output-dir ./
```
#### Defining EvalTaskConfig
The `EvalTaskConfig` are user specified config to define:
1. `EvalCandidate` to run generation on:
- `ModelCandidate`: The model will be used for generation through LlamaStack /inference API.
- `AgentCandidate`: The agentic system specified by AgentConfig will be used for generation through LlamaStack /agents API.
2. Optionally scoring function params to allow customization of scoring function behaviour. This is useful to parameterize generic scoring functions such as LLMAsJudge with custom `judge_model` / `judge_prompt`.
**Example Benchmark EvalTaskConfig**
```json
{
"type": "benchmark",
"eval_candidate": {
"type": "model",
"model": "Llama3.2-3B-Instruct",
"sampling_params": {
"strategy": "greedy",
"temperature": 0,
"top_p": 0.95,
"top_k": 0,
"max_tokens": 0,
"repetition_penalty": 1.0
}
}
}
```
**Example Application EvalTaskConfig**
```json
{
"type": "app",
"eval_candidate": {
"type": "model",
"model": "Llama3.1-405B-Instruct",
"sampling_params": {
"strategy": "greedy",
"temperature": 0,
"top_p": 0.95,
"top_k": 0,
"max_tokens": 0,
"repetition_penalty": 1.0
}
},
"scoring_params": {
"llm-as-judge::llm_as_judge_base": {
"type": "llm_as_judge",
"judge_model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt_template": "Your job is to look at a question, a gold target ........",
"judge_score_regexes": [
"(A|B|C)"
]
}
}
}
```

View file

@ -0,0 +1,9 @@
# Cookbooks
- [Evaluations Flow](evals.md)
```{toctree}
:maxdepth: 2
:hidden:
evals.md
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 249 KiB

View file

@ -1,20 +0,0 @@
# Developer Guide
```{toctree}
:hidden:
:maxdepth: 1
building_distro
```
## Key Concepts
### API Provider
A Provider is what makes the API real -- they provide the actual implementation backing the API.
As an example, for Inference, we could have the implementation be backed by open source libraries like `[ torch | vLLM | TensorRT ]` as possible options.
A provider can also be just a pointer to a remote REST service -- for example, cloud providers or dedicated inference providers could serve these APIs.
### Distribution
A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers -- some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications.

View file

@ -1,15 +1,22 @@
# Developer Guide: Assemble a Llama Stack Distribution
# Build your own Distribution
This guide will walk you through the steps to get started with building a Llama Stack distributiom from scratch with your choice of API providers. Please see the [Getting Started Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html) if you just want the basic steps to start a Llama Stack distribution.
This guide will walk you through the steps to get started with building a Llama Stack distribution from scratch with your choice of API providers.
## Step 1. Build
### Llama Stack Build Options
## Llama Stack Build
In order to build your own distribution, we recommend you clone the `llama-stack` repository.
```
git clone git@github.com:meta-llama/llama-stack.git
cd llama-stack
pip install -e .
llama stack build -h
```
We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify:
- `name`: the name for our distribution (e.g. `my-stack`)
- `image_type`: our build image type (`conda | docker`)
@ -240,7 +247,7 @@ After this step is successful, you should be able to find the built docker image
::::
## Step 2. Run
## Running your Stack server
Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack build` step.
```
@ -250,11 +257,6 @@ llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-
```
$ llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml
Loaded model...
Serving API datasets
GET /datasets/get
GET /datasets/list
POST /datasets/register
Serving API inspect
GET /health
GET /providers/list
@ -263,41 +265,7 @@ Serving API inference
POST /inference/chat_completion
POST /inference/completion
POST /inference/embeddings
Serving API scoring_functions
GET /scoring_functions/get
GET /scoring_functions/list
POST /scoring_functions/register
Serving API scoring
POST /scoring/score
POST /scoring/score_batch
Serving API memory_banks
GET /memory_banks/get
GET /memory_banks/list
POST /memory_banks/register
Serving API memory
POST /memory/insert
POST /memory/query
Serving API safety
POST /safety/run_shield
Serving API eval
POST /eval/evaluate
POST /eval/evaluate_batch
POST /eval/job/cancel
GET /eval/job/result
GET /eval/job/status
Serving API shields
GET /shields/get
GET /shields/list
POST /shields/register
Serving API datasetio
GET /datasetio/get_rows_paginated
Serving API telemetry
GET /telemetry/get_trace
POST /telemetry/log_event
Serving API models
GET /models/get
GET /models/list
POST /models/register
...
Serving API agents
POST /agents/create
POST /agents/session/create
@ -316,8 +284,6 @@ INFO: Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit
INFO: 2401:db00:35c:2d2b:face:0:c9:0:54678 - "GET /models/list HTTP/1.1" 200 OK
```
> [!IMPORTANT]
> The "local" distribution inference server currently only supports CUDA. It will not work on Apple Silicon machines.
### Troubleshooting
> [!TIP]
> You might need to use the flag `--disable-ipv6` to Disable IPv6 support
If you encounter any issues, search through our [GitHub Issues](https://github.com/meta-llama/llama-stack/issues), or file an new issue.

View file

@ -0,0 +1,164 @@
# Configuring a Stack
The Llama Stack runtime configuration is specified as a YAML file. Here is a simplied version of an example configuration file for the Ollama distribution:
```{dropdown} Sample Configuration File
```yaml
version: 2
conda_env: ollama
apis:
- agents
- inference
- memory
- safety
- telemetry
providers:
inference:
- provider_id: ollama
provider_type: remote::ollama
config:
url: ${env.OLLAMA_URL:http://localhost:11434}
memory:
- provider_id: faiss
provider_type: inline::faiss
config:
kvstore:
type: sqlite
namespace: null
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/faiss_store.db
safety:
- provider_id: llama-guard
provider_type: inline::llama-guard
config: {}
agents:
- provider_id: meta-reference
provider_type: inline::meta-reference
config:
persistence_store:
type: sqlite
namespace: null
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/agents_store.db
telemetry:
- provider_id: meta-reference
provider_type: inline::meta-reference
config: {}
metadata_store:
namespace: null
type: sqlite
db_path: ${env.SQLITE_STORE_DIR:~/.llama/distributions/ollama}/registry.db
models:
- metadata: {}
model_id: ${env.INFERENCE_MODEL}
provider_id: ollama
provider_model_id: null
shields: []
```
Let's break this down into the different sections. The first section specifies the set of APIs that the stack server will serve:
```yaml
apis:
- agents
- inference
- memory
- safety
- telemetry
```
## Providers
Next up is the most critical part: the set of providers that the stack will use to serve the above APIs. Consider the `inference` API:
```yaml
providers:
inference:
- provider_id: ollama
provider_type: remote::ollama
config:
url: ${env.OLLAMA_URL:http://localhost:11434}
```
A few things to note:
- A _provider instance_ is identified with an (identifier, type, configuration) tuple. The identifier is a string you can choose freely.
- You can instantiate any number of provider instances of the same type.
- The configuration dictionary is provider-specific. Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via `llama stack run`), you can specify `--env OLLAMA_URL=http://my-server:11434` to override the default value.
## Resources
Finally, let's look at the `models` section:
```yaml
models:
- metadata: {}
model_id: ${env.INFERENCE_MODEL}
provider_id: ollama
provider_model_id: null
```
A Model is an instance of a "Resource" (see [Concepts](../concepts/index)) and is associated with a specific inference provider (in this case, the provider with identifier `ollama`). This is an instance of a "pre-registered" model. While we always encourage the clients to always register models before using them, some Stack servers may come up a list of "already known and available" models.
What's with the `provider_model_id` field? This is an identifier for the model inside the provider's model catalog. Contrast it with `model_id` which is the identifier for the same model for Llama Stack's purposes. For example, you may want to name "llama3.2:vision-11b" as "image_captioning_model" when you use it in your Stack interactions. When omitted, the server will set `provider_model_id` to be the same as `model_id`.
## Extending to handle Safety
Configuring Safety can be a little involved so it is instructive to go through an example.
The Safety API works with the associated Resource called a `Shield`. Providers can support various kinds of Shields. Good examples include the [Llama Guard](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/) system-safety models, or [Bedrock Guardrails](https://aws.amazon.com/bedrock/guardrails/).
To configure a Bedrock Shield, you would need to add:
- A Safety API provider instance with type `remote::bedrock`
- A Shield resource served by this provider.
```yaml
...
providers:
safety:
- provider_id: bedrock
provider_type: remote::bedrock
config:
aws_access_key_id: ${env.AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${env.AWS_SECRET_ACCESS_KEY}
...
shields:
- provider_id: bedrock
params:
guardrailVersion: ${env.GUARDRAIL_VERSION}
provider_shield_id: ${env.GUARDRAIL_ID}
...
```
The situation is more involved if the Shield needs _Inference_ of an associated model. This is the case with Llama Guard. In that case, you would need to add:
- A Safety API provider instance with type `inline::llama-guard`
- An Inference API provider instance for serving the model.
- A Model resource associated with this provider.
- A Shield resource served by the Safety provider.
The yaml configuration for this setup, assuming you were using vLLM as your inference server, would look like:
```yaml
...
providers:
safety:
- provider_id: llama-guard
provider_type: inline::llama-guard
config: {}
inference:
# this vLLM server serves the "normal" inference model (e.g., llama3.2:3b)
- provider_id: vllm-0
provider_type: remote::vllm
config:
url: ${env.VLLM_URL:http://localhost:8000}
# this vLLM server serves the llama-guard model (e.g., llama-guard:3b)
- provider_id: vllm-1
provider_type: remote::vllm
config:
url: ${env.SAFETY_VLLM_URL:http://localhost:8001}
...
models:
- metadata: {}
model_id: ${env.INFERENCE_MODEL}
provider_id: vllm-0
provider_model_id: null
- metadata: {}
model_id: ${env.SAFETY_MODEL}
provider_id: vllm-1
provider_model_id: null
shields:
- provider_id: llama-guard
shield_id: ${env.SAFETY_MODEL} # Llama Guard shields are identified by the corresponding LlamaGuard model
provider_shield_id: null
...
```

View file

@ -0,0 +1,36 @@
# Using Llama Stack as a Library
If you are planning to use an external service for Inference (even Ollama or TGI counts as external), it is often easier to use Llama Stack as a library. This avoids the overhead of setting up a server. For [example](https://github.com/meta-llama/llama-stack-client-python/blob/main/src/llama_stack_client/lib/direct/test.py):
```python
from llama_stack_client.lib.direct.direct import LlamaStackDirectClient
client = await LlamaStackDirectClient.from_template('ollama')
await client.initialize()
```
This will parse your config and set up any inline implementations and remote clients needed for your implementation.
Then, you can access the APIs like `models` and `inference` on the client and call their methods directly:
```python
response = await client.models.list()
print(response)
```
```python
response = await client.inference.chat_completion(
messages=[UserMessage(content="What is the capital of France?", role="user")],
model="Llama3.1-8B-Instruct",
stream=False,
)
print("\nChat completion response:")
print(response)
```
If you've created a [custom distribution](https://llama-stack.readthedocs.io/en/latest/distributions/building_distro.html), you can also use the run.yaml configuration file directly:
```python
client = await LlamaStackDirectClient.from_config(config_path)
await client.initialize()
```

View file

@ -0,0 +1,40 @@
# Starting a Llama Stack
```{toctree}
:maxdepth: 3
:hidden:
importing_as_library
building_distro
configuration
```
<!-- self_hosted_distro/index -->
<!-- remote_hosted_distro/index -->
<!-- ondevice_distro/index -->
You can instantiate a Llama Stack in one of the following ways:
- **As a Library**: this is the simplest, especially if you are using an external inference service. See [Using Llama Stack as a Library](importing_as_library)
- **Docker**: we provide a number of pre-built Docker containers so you can start a Llama Stack server instantly. You can also build your own custom Docker container.
- **Conda**: finally, you can build a custom Llama Stack server using `llama stack build` containing the exact combination of providers you wish. We have provided various templates to make getting started easier.
Which templates / distributions to choose depends on the hardware you have for running LLM inference.
- **Do you have access to a machine with powerful GPUs?**
If so, we suggest:
- {dockerhub}`distribution-remote-vllm` ([Guide](self_hosted_distro/remote-vllm))
- {dockerhub}`distribution-meta-reference-gpu` ([Guide](self_hosted_distro/meta-reference-gpu))
- {dockerhub}`distribution-tgi` ([Guide](self_hosted_distro/tgi))
- **Are you running on a "regular" desktop machine?**
If so, we suggest:
- {dockerhub}`distribution-ollama` ([Guide](self_hosted_distro/ollama))
- **Do you have an API key for a remote inference provider like Fireworks, Together, etc.?** If so, we suggest:
- {dockerhub}`distribution-together` ([Guide](remote_hosted_distro/index))
- {dockerhub}`distribution-fireworks` ([Guide](remote_hosted_distro/index))
- **Do you want to run Llama Stack inference on your iOS / Android device** If so, we suggest:
- [iOS SDK](ondevice_distro/ios_sdk)
- Android (coming soon)
You can also build your own [custom distribution](building_distro).

View file

@ -1,3 +1,6 @@
---
orphan: true
---
# iOS SDK
We offer both remote and on-device use of Llama Stack in Swift via two components:
@ -5,7 +8,7 @@ We offer both remote and on-device use of Llama Stack in Swift via two component
1. [llama-stack-client-swift](https://github.com/meta-llama/llama-stack-client-swift/)
2. [LocalInferenceImpl](https://github.com/meta-llama/llama-stack/tree/main/llama_stack/providers/inline/ios/inference)
```{image} ../../../../_static/remote_or_local.gif
```{image} ../../../_static/remote_or_local.gif
:alt: Seamlessly switching between local, on-device inference and remote hosted inference
:width: 412px
:align: center

View file

@ -1,4 +1,7 @@
# Remote-Hosted Distribution
---
orphan: true
---
# Remote-Hosted Distributions
Remote-Hosted distributions are available endpoints serving Llama Stack API that you can directly connect to.

View file

@ -0,0 +1,67 @@
---
orphan: true
---
# Bedrock Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-bedrock` distribution consists of the following provider configurations:
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| inference | `remote::bedrock` |
| memory | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
| safety | `remote::bedrock` |
| telemetry | `inline::meta-reference` |
### Environment Variables
The following environment variables can be configured:
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
### Prerequisite: API Keys
Make sure you have access to a AWS Bedrock API Key. You can get one by visiting [AWS Bedrock](https://aws.amazon.com/bedrock/).
## Running Llama Stack with AWS Bedrock
You can do this via Conda (build code) or Docker which has a pre-built image.
### Via Docker
This method allows you to get started quickly without having to build the distribution code.
```bash
LLAMA_STACK_PORT=5001
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
llamastack/distribution-bedrock \
--port $LLAMA_STACK_PORT \
--env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
--env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN
```
### Via Conda
```bash
llama stack build --template bedrock --image-type conda
llama stack run ./run.yaml \
--port $LLAMA_STACK_PORT \
--env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
--env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
--env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN
```

View file

@ -1,5 +1,15 @@
---
orphan: true
---
# Dell-TGI Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-tgi` distribution consists of the following provider configurations.

View file

@ -1,5 +1,15 @@
---
orphan: true
---
# Fireworks Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-fireworks` distribution consists of the following provider configurations.
| API | Provider(s) |
@ -51,9 +61,7 @@ LLAMA_STACK_PORT=5001
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ./run.yaml:/root/my-run.yaml \
llamastack/distribution-fireworks \
--yaml-config /root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env FIREWORKS_API_KEY=$FIREWORKS_API_KEY
```
@ -63,6 +71,6 @@ docker run \
```bash
llama stack build --template fireworks --image-type conda
llama stack run ./run.yaml \
--port 5001 \
--port $LLAMA_STACK_PORT \
--env FIREWORKS_API_KEY=$FIREWORKS_API_KEY
```

View file

@ -1,5 +1,15 @@
---
orphan: true
---
# Meta Reference Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-meta-reference-gpu` distribution consists of the following provider configurations:
| API | Provider(s) |
@ -26,7 +36,7 @@ The following environment variables can be configured:
## Prerequisite: Downloading Models
Please make sure you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](https://llama-stack.readthedocs.io/en/latest/cli_reference/download_models.html) here to download the models. Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
Please make sure you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](https://llama-stack.readthedocs.io/en/latest/references/llama_cli_reference/download_models.html) here to download the models. Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
```
$ ls ~/.llama/checkpoints
@ -47,9 +57,7 @@ LLAMA_STACK_PORT=5001
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ./run.yaml:/root/my-run.yaml \
llamastack/distribution-meta-reference-gpu \
/root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
```
@ -60,9 +68,7 @@ If you are using Llama Stack Safety / Shield APIs, use:
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ./run-with-safety.yaml:/root/my-run.yaml \
llamastack/distribution-meta-reference-gpu \
/root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
--env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
@ -74,7 +80,7 @@ Make sure you have done `pip install llama-stack` and have the Llama Stack CLI a
```bash
llama stack build --template meta-reference-gpu --image-type conda
llama stack run ./run.yaml \
llama stack run distributions/meta-reference-gpu/run.yaml \
--port 5001 \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
```
@ -82,7 +88,7 @@ llama stack run ./run.yaml \
If you are using Llama Stack Safety / Shield APIs, use:
```bash
llama stack run ./run-with-safety.yaml \
llama stack run distributions/meta-reference-gpu/run-with-safety.yaml \
--port 5001 \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
--env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B

View file

@ -0,0 +1,95 @@
---
orphan: true
---
# Meta Reference Quantized Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-meta-reference-quantized-gpu` distribution consists of the following provider configurations:
| API | Provider(s) |
|-----|-------------|
| agents | `inline::meta-reference` |
| inference | `inline::meta-reference-quantized` |
| memory | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
| safety | `inline::llama-guard` |
| telemetry | `inline::meta-reference` |
The only difference vs. the `meta-reference-gpu` distribution is that it has support for more efficient inference -- with fp8, int4 quantization, etc.
Note that you need access to nvidia GPUs to run this distribution. This distribution is not compatible with CPU-only machines or machines with AMD GPUs.
### Environment Variables
The following environment variables can be configured:
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `5001`)
- `INFERENCE_MODEL`: Inference model loaded into the Meta Reference server (default: `meta-llama/Llama-3.2-3B-Instruct`)
- `INFERENCE_CHECKPOINT_DIR`: Directory containing the Meta Reference model checkpoint (default: `null`)
## Prerequisite: Downloading Models
Please make sure you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](https://llama-stack.readthedocs.io/en/latest/references/llama_cli_reference/download_models.html) here to download the models. Run `llama model list` to see the available models to download, and `llama model download` to download the checkpoints.
```
$ ls ~/.llama/checkpoints
Llama3.1-8B Llama3.2-11B-Vision-Instruct Llama3.2-1B-Instruct Llama3.2-90B-Vision-Instruct Llama-Guard-3-8B
Llama3.1-8B-Instruct Llama3.2-1B Llama3.2-3B-Instruct Llama-Guard-3-1B Prompt-Guard-86M
```
## Running the Distribution
You can do this via Conda (build code) or Docker which has a pre-built image.
### Via Docker
This method allows you to get started quickly without having to build the distribution code.
```bash
LLAMA_STACK_PORT=5001
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
llamastack/distribution-meta-reference-quantized-gpu \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
```
If you are using Llama Stack Safety / Shield APIs, use:
```bash
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
llamastack/distribution-meta-reference-quantized-gpu \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
--env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
```
### Via Conda
Make sure you have done `pip install llama-stack` and have the Llama Stack CLI available.
```bash
llama stack build --template meta-reference-quantized-gpu --image-type conda
llama stack run distributions/meta-reference-quantized-gpu/run.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
```
If you are using Llama Stack Safety / Shield APIs, use:
```bash
llama stack run distributions/meta-reference-quantized-gpu/run-with-safety.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
--env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
```

View file

@ -47,7 +47,7 @@ docker run \
llamastack/distribution-nvidia \
--yaml-config /root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env FIREWORKS_API_KEY=$FIREWORKS_API_KEY
--env NVIDIA_API_KEY=$NVIDIA_API_KEY
```
### Via Conda
@ -56,5 +56,5 @@ docker run \
llama stack build --template fireworks --image-type conda
llama stack run ./run.yaml \
--port 5001 \
--env FIREWORKS_API_KEY=$FIREWORKS_API_KEY
--env NVIDIA_API_KEY=$NVIDIA_API_KEY
```

View file

@ -1,5 +1,15 @@
---
orphan: true
---
# Ollama Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-ollama` distribution consists of the following provider configurations.
| API | Provider(s) |
@ -59,9 +69,7 @@ docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
-v ./run.yaml:/root/my-run.yaml \
llamastack/distribution-ollama \
--yaml-config /root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://host.docker.internal:11434
@ -110,9 +118,9 @@ llama stack run ./run-with-safety.yaml \
### (Optional) Update Model Serving Configuration
> [!NOTE]
> Please check the [OLLAMA_SUPPORTED_MODELS](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers.remote/inference/ollama/ollama.py) for the supported Ollama models.
```{note}
Please check the [model_aliases](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/inference/ollama/ollama.py#L45) variable for supported Ollama models.
```
To serve a new model with `ollama`
```bash

View file

@ -1,4 +1,13 @@
---
orphan: true
---
# Remote vLLM Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-remote-vllm` distribution consists of the following provider configurations:

View file

@ -1,5 +1,16 @@
---
orphan: true
---
# TGI Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
| API | Provider(s) |
@ -78,9 +89,7 @@ LLAMA_STACK_PORT=5001
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ./run.yaml:/root/my-run.yaml \
llamastack/distribution-tgi \
--yaml-config /root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env TGI_URL=http://host.docker.internal:$INFERENCE_PORT
@ -109,18 +118,18 @@ Make sure you have done `pip install llama-stack` and have the Llama Stack CLI a
```bash
llama stack build --template tgi --image-type conda
llama stack run ./run.yaml
--port 5001
--env INFERENCE_MODEL=$INFERENCE_MODEL
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env TGI_URL=http://127.0.0.1:$INFERENCE_PORT
```
If you are using Llama Stack Safety / Shield APIs, use:
```bash
llama stack run ./run-with-safety.yaml
--port 5001
--env INFERENCE_MODEL=$INFERENCE_MODEL
--env TGI_URL=http://127.0.0.1:$INFERENCE_PORT
--env SAFETY_MODEL=$SAFETY_MODEL
llama stack run ./run-with-safety.yaml \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env TGI_URL=http://127.0.0.1:$INFERENCE_PORT \
--env SAFETY_MODEL=$SAFETY_MODEL \
--env TGI_SAFETY_URL=http://127.0.0.1:$SAFETY_PORT
```

View file

@ -1,4 +1,14 @@
# Fireworks Distribution
---
orphan: true
---
# Together Distribution
```{toctree}
:maxdepth: 2
:hidden:
self
```
The `llamastack/distribution-together` distribution consists of the following provider configurations.
@ -50,9 +60,7 @@ LLAMA_STACK_PORT=5001
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ./run.yaml:/root/my-run.yaml \
llamastack/distribution-together \
--yaml-config /root/my-run.yaml \
--port $LLAMA_STACK_PORT \
--env TOGETHER_API_KEY=$TOGETHER_API_KEY
```
@ -62,6 +70,6 @@ docker run \
```bash
llama stack build --template together --image-type conda
llama stack run ./run.yaml \
--port 5001 \
--port $LLAMA_STACK_PORT \
--env TOGETHER_API_KEY=$TOGETHER_API_KEY
```

View file

@ -1,41 +0,0 @@
# Llama Stack Developer Cookbook
Based on your developer needs, below are references to guides to help you get started.
### Hosted Llama Stack Endpoint
* Developer Need: I want to connect to a Llama Stack endpoint to build my applications.
* Effort: 1min
* Guide:
- Checkout our [DeepLearning course](https://www.deeplearning.ai/short-courses/introducing-multimodal-llama-3-2) on building with Llama Stack apps on pre-hosted Llama Stack endpoint.
### Local meta-reference Llama Stack Server
* Developer Need: I want to start a local Llama Stack server with my GPU using meta-reference implementations.
* Effort: 5min
* Guide:
- Please see our [meta-reference-gpu](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/meta-reference-gpu.html) on starting up a meta-reference Llama Stack server.
### Llama Stack Server with Remote Providers
* Developer need: I want a Llama Stack distribution with a remote provider.
* Effort: 10min
* Guide
- Please see our [Distributions Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/index.html) on starting up distributions with remote providers.
### On-Device (iOS) Llama Stack
* Developer Need: I want to use Llama Stack on-Device
* Effort: 1.5hr
* Guide:
- Please see our [iOS Llama Stack SDK](./ios_sdk.md) implementations
### Assemble your own Llama Stack Distribution
* Developer Need: I want to assemble my own distribution with API providers to my likings
* Effort: 30min
* Guide
- Please see our [Building Distribution](./building_distro.md) guide for assembling your own Llama Stack distribution with your choice of API providers.
### Adding a New API Provider
* Developer Need: I want to add a new API provider to Llama Stack.
* Effort: 3hr
* Guide
- Please see our [Adding a New API Provider](https://llama-stack.readthedocs.io/en/latest/api_providers/new_api_provider.html) guide for adding a new API provider.

View file

@ -1,9 +0,0 @@
# On-Device Distribution
On-device distributions are Llama Stack distributions that run locally on your iOS / Android device.
```{toctree}
:maxdepth: 1
ios_sdk
```

View file

@ -1,58 +0,0 @@
# Bedrock Distribution
### Connect to a Llama Stack Bedrock Endpoint
- You may connect to Amazon Bedrock APIs for running LLM inference
The `llamastack/distribution-bedrock` distribution consists of the following provider configurations.
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|----------------- |--------------- |---------------- |---------------- |---------------- |---------------- |
| **Provider(s)** | remote::bedrock | meta-reference | meta-reference | remote::bedrock | meta-reference |
### Docker: Start the Distribution (Single Node CPU)
> [!NOTE]
> This assumes you have valid AWS credentials configured with access to Amazon Bedrock.
```
$ cd distributions/bedrock && docker compose up
```
Make sure in your `run.yaml` file, your inference provider is pointing to the correct AWS configuration. E.g.
```
inference:
- provider_id: bedrock0
provider_type: remote::bedrock
config:
aws_access_key_id: <AWS_ACCESS_KEY_ID>
aws_secret_access_key: <AWS_SECRET_ACCESS_KEY>
aws_session_token: <AWS_SESSION_TOKEN>
region_name: <AWS_REGION>
```
### Conda llama stack run (Single Node CPU)
```bash
llama stack build --template bedrock --image-type conda
# -- modify run.yaml with valid AWS credentials
llama stack run ./run.yaml
```
### (Optional) Update Model Serving Configuration
Use `llama-stack-client models list` to check the available models served by Amazon Bedrock.
```
$ llama-stack-client models list
+------------------------------+------------------------------+---------------+------------+
| identifier | llama_model | provider_id | metadata |
+==============================+==============================+===============+============+
| Llama3.1-8B-Instruct | meta.llama3-1-8b-instruct-v1:0 | bedrock0 | {} |
+------------------------------+------------------------------+---------------+------------+
| Llama3.1-70B-Instruct | meta.llama3-1-70b-instruct-v1:0 | bedrock0 | {} |
+------------------------------+------------------------------+---------------+------------+
| Llama3.1-405B-Instruct | meta.llama3-1-405b-instruct-v1:0 | bedrock0 | {} |
+------------------------------+------------------------------+---------------+------------+
```

View file

@ -1,28 +0,0 @@
# Self-Hosted Distribution
We offer deployable distributions where you can host your own Llama Stack server using local inference.
| **Distribution** | **Llama Stack Docker** | Start This Distribution | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|:----------------: |:------------------------------------------: |:-----------------------: |:------------------: |:------------------: |:------------------: |:------------------: |:------------------: |
| Meta Reference | [llamastack/distribution-meta-reference-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-gpu/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/meta-reference-gpu.html) | meta-reference | meta-reference | meta-reference; remote::pgvector; remote::chromadb | meta-reference | meta-reference |
| Meta Reference Quantized | [llamastack/distribution-meta-reference-quantized-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-quantized-gpu/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/meta-reference-quantized-gpu.html) | meta-reference-quantized | meta-reference | meta-reference; remote::pgvector; remote::chromadb | meta-reference | meta-reference |
| Ollama | [llamastack/distribution-ollama](https://hub.docker.com/repository/docker/llamastack/distribution-ollama/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/ollama.html) | remote::ollama | meta-reference | remote::pgvector; remote::chromadb | meta-reference | meta-reference |
| TGI | [llamastack/distribution-tgi](https://hub.docker.com/repository/docker/llamastack/distribution-tgi/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/tgi.html) | remote::tgi | meta-reference | meta-reference; remote::pgvector; remote::chromadb | meta-reference | meta-reference |
| Together | [llamastack/distribution-together](https://hub.docker.com/repository/docker/llamastack/distribution-together/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/together.html) | remote::together | meta-reference | remote::weaviate | meta-reference | meta-reference |
| Fireworks | [llamastack/distribution-fireworks](https://hub.docker.com/repository/docker/llamastack/distribution-fireworks/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/fireworks.html) | remote::fireworks | meta-reference | remote::weaviate | meta-reference | meta-reference |
| Bedrock | [llamastack/distribution-bedrock](https://hub.docker.com/repository/docker/llamastack/distribution-bedrock/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/bedrock.html) | remote::bedrock | meta-reference | remote::weaviate | meta-reference | meta-reference |
```{toctree}
:maxdepth: 1
meta-reference-gpu
meta-reference-quantized-gpu
ollama
tgi
dell-tgi
together
fireworks
remote-vllm
bedrock
```

View file

@ -1,54 +0,0 @@
# Meta Reference Quantized Distribution
The `llamastack/distribution-meta-reference-quantized-gpu` distribution consists of the following provider configurations.
| **API** | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|----------------- |------------------------ |---------------- |-------------------------------------------------- |---------------- |---------------- |
| **Provider(s)** | meta-reference-quantized | meta-reference | meta-reference, remote::pgvector, remote::chroma | meta-reference | meta-reference |
The only difference vs. the `meta-reference-gpu` distribution is that it has support for more efficient inference -- with fp8, int4 quantization, etc.
### Step 0. Prerequisite - Downloading Models
Please make sure you have llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](https://llama-stack.readthedocs.io/en/latest/cli_reference/download_models.html) here to download the models.
```
$ ls ~/.llama/checkpoints
Llama3.2-3B-Instruct:int4-qlora-eo8
```
### Step 1. Start the Distribution
#### (Option 1) Start with Docker
```
$ cd distributions/meta-reference-quantized-gpu && docker compose up
```
> [!NOTE]
> This assumes you have access to GPU to start a local server with access to your GPU.
> [!NOTE]
> `~/.llama` should be the path containing downloaded weights of Llama models.
This will download and start running a pre-built docker container. Alternatively, you may use the following commands:
```
docker run -it -p 5000:5000 -v ~/.llama:/root/.llama -v ./run.yaml:/root/my-run.yaml --gpus=all distribution-meta-reference-quantized-gpu --yaml_config /root/my-run.yaml
```
#### (Option 2) Start with Conda
1. Install the `llama` CLI. See [CLI Reference](https://llama-stack.readthedocs.io/en/latest/cli_reference/index.html)
2. Build the `meta-reference-quantized-gpu` distribution
```
$ llama stack build --template meta-reference-quantized-gpu --image-type conda
```
3. Start running distribution
```
$ cd distributions/meta-reference-quantized-gpu
$ llama stack run ./run.yaml
```

View file

@ -1,194 +1,155 @@
# Getting Started
# Quick Start
```{toctree}
:maxdepth: 2
:hidden:
In this guide, we'll through how you can use the Llama Stack client SDK to build a simple RAG agent.
distributions/self_hosted_distro/index
distributions/remote_hosted_distro/index
distributions/ondevice_distro/index
```
The most critical requirement for running the agent is running inference on the underlying Llama model. Depending on what hardware (GPUs) you have available, you have various options. We will use `Ollama` for this purpose as it is the easiest to get started with and yet robust.
At the end of the guide, you will have learned how to:
- get a Llama Stack server up and running
- set up an agent (with tool-calling and vector stores) that works with the above server
To see more example apps built using Llama Stack, see [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main).
## Step 1. Starting Up Llama Stack Server
### Decide Your Build Type
There are two ways to start a Llama Stack:
- **Docker**: we provide a number of pre-built Docker containers allowing you to get started instantly. If you are focused on application development, we recommend this option.
- **Conda**: the `llama` CLI provides a simple set of commands to build, configure and run a Llama Stack server containing the exact combination of providers you wish. We have provided various templates to make getting started easier.
Both of these provide options to run model inference using our reference implementations, Ollama, TGI, vLLM or even remote providers like Fireworks, Together, Bedrock, etc.
### Decide Your Inference Provider
Running inference on the underlying Llama model is one of the most critical requirements. Depending on what hardware you have available, you have various options. Note that each option have different necessary prerequisites.
- **Do you have access to a machine with powerful GPUs?**
If so, we suggest:
- [distribution-meta-reference-gpu](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/meta-reference-gpu.html)
- [distribution-tgi](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/tgi.html)
- **Are you running on a "regular" desktop machine?**
If so, we suggest:
- [distribution-ollama](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/ollama.html)
- **Do you have an API key for a remote inference provider like Fireworks, Together, etc.?** If so, we suggest:
- [distribution-together](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/remote_hosted_distro/together.html)
- [distribution-fireworks](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/remote_hosted_distro/fireworks.html)
- **Do you want to run Llama Stack inference on your iOS / Android device** If so, we suggest:
- [iOS](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/ondevice_distro/ios_sdk.html)
- [Android](https://github.com/meta-llama/llama-stack-client-kotlin) (coming soon)
Please see our pages in detail for the types of distributions we offer:
1. [Self-Hosted Distribution](./distributions/self_hosted_distro/index.md): If you want to run Llama Stack inference on your local machine.
2. [Remote-Hosted Distribution](./distributions/remote_hosted_distro/index.md): If you want to connect to a remote hosted inference provider.
3. [On-device Distribution](./distributions/ondevice_distro/index.md): If you want to run Llama Stack inference on your iOS / Android device.
### Table of Contents
Once you have decided on the inference provider and distribution to use, use the following guides to get started.
##### 1.0 Prerequisite
```
$ git clone git@github.com:meta-llama/llama-stack.git
```
::::{tab-set}
:::{tab-item} meta-reference-gpu
##### System Requirements
Access to Single-Node GPU to start a local server.
##### Downloading Models
Please make sure you have Llama model checkpoints downloaded in `~/.llama` before proceeding. See [installation guide](https://llama-stack.readthedocs.io/en/latest/cli_reference/download_models.html) here to download the models.
```
$ ls ~/.llama/checkpoints
Llama3.1-8B Llama3.2-11B-Vision-Instruct Llama3.2-1B-Instruct Llama3.2-90B-Vision-Instruct Llama-Guard-3-8B
Llama3.1-8B-Instruct Llama3.2-1B Llama3.2-3B-Instruct Llama-Guard-3-1B Prompt-Guard-86M
```
:::
:::{tab-item} vLLM
##### System Requirements
Access to Single-Node GPU to start a vLLM server.
:::
:::{tab-item} tgi
##### System Requirements
Access to Single-Node GPU to start a TGI server.
:::
:::{tab-item} ollama
##### System Requirements
Access to Single-Node CPU/GPU able to run ollama.
:::
:::{tab-item} together
##### System Requirements
Access to Single-Node CPU with Together hosted endpoint via API_KEY from [together.ai](https://api.together.xyz/signin).
:::
:::{tab-item} fireworks
##### System Requirements
Access to Single-Node CPU with Fireworks hosted endpoint via API_KEY from [fireworks.ai](https://fireworks.ai/).
:::
::::
##### 1.1. Start the distribution
::::{tab-set}
:::{tab-item} meta-reference-gpu
- [Start Meta Reference GPU Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/meta-reference-gpu.html)
:::
:::{tab-item} vLLM
- [Start vLLM Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/remote-vllm.html)
:::
:::{tab-item} tgi
- [Start TGI Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/tgi.html)
:::
:::{tab-item} ollama
- [Start Ollama Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/ollama.html)
:::
:::{tab-item} together
- [Start Together Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/together.html)
:::
:::{tab-item} fireworks
- [Start Fireworks Distribution](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/fireworks.html)
:::
::::
##### Troubleshooting
- If you encounter any issues, search through our [GitHub Issues](https://github.com/meta-llama/llama-stack/issues), or file an new issue.
- Use `--port <PORT>` flag to use a different port number. For docker run, update the `-p <PORT>:<PORT>` flag.
## Step 2. Run Llama Stack App
### Chat Completion Test
Once the server is set up, we can test it with a client to verify it's working correctly. The following command will send a chat completion request to the server's `/inference/chat_completion` API:
First, let's set up some environment variables that we will use in the rest of the guide. Note that if you open up a new terminal, you will need to set these again.
```bash
$ curl http://localhost:5000/alpha/inference/chat-completion \
-H "Content-Type: application/json" \
-d '{
"model_id": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
# ollama names this model differently, and we must use the ollama name when loading the model
export OLLAMA_INFERENCE_MODEL="llama3.2:3b-instruct-fp16"
export LLAMA_STACK_PORT=5001
```
### 1. Start Ollama
```bash
ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m
```
By default, Ollama keeps the model loaded in memory for 5 minutes which can be too short. We set the `--keepalive` flag to 60 minutes to enspagents/agenure the model remains loaded for sometime.
### 2. Start the Llama Stack server
Llama Stack is based on a client-server architecture. It consists of a server which can be configured very flexibly so you can mix-and-match various providers for its individual API components -- beyond Inference, these include Memory, Agents, Telemetry, Evals and so forth.
```bash
docker run \
-it \
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
-v ~/.llama:/root/.llama \
llamastack/distribution-ollama \
--port $LLAMA_STACK_PORT \
--env INFERENCE_MODEL=$INFERENCE_MODEL \
--env OLLAMA_URL=http://host.docker.internal:11434
```
Configuration for this is available at `distributions/ollama/run.yaml`.
### 3. Use the Llama Stack client SDK
You can interact with the Llama Stack server using the `llama-stack-client` CLI or via the Python SDK.
```bash
pip install llama-stack-client
```
Let's use the `llama-stack-client` CLI to check the connectivity to the server.
```bash
llama-stack-client --endpoint http://localhost:$LLAMA_STACK_PORT models list
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ identifier ┃ provider_id ┃ provider_resource_id ┃ metadata ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ meta-llama/Llama-3.2-3B-Instruct │ ollama │ llama3.2:3b-instruct-fp16 │ │
└──────────────────────────────────┴─────────────┴───────────────────────────┴──────────┘
```
You can test basic Llama inference completion using the CLI too.
```bash
llama-stack-client --endpoint http://localhost:$LLAMA_STACK_PORT \
inference chat_completion \
--message "hello, what model are you?"
```
Here is a simple example to perform chat completions using Python instead of the CLI.
```python
import os
from llama_stack_client import LlamaStackClient
client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")
# List available models
models = client.models.list()
print(models)
response = client.inference.chat_completion(
model_id=os.environ["INFERENCE_MODEL"],
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write me a 2 sentence poem about the moon"}
],
"sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
}'
Output:
{'completion_message': {'role': 'assistant',
'content': 'The moon glows softly in the midnight sky, \nA beacon of wonder, as it catches the eye.',
'stop_reason': 'out_of_tokens',
'tool_calls': []},
'logprobs': null}
{"role": "user", "content": "Write a haiku about coding"}
]
)
print(response.completion_message.content)
```
### Run Agent App
### 4. Your first RAG agent
To run an agent app, check out examples demo scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo. To run a simple agent app:
Here is an example of a simple RAG agent that uses the Llama Stack client SDK.
```bash
$ git clone git@github.com:meta-llama/llama-stack-apps.git
$ cd llama-stack-apps
$ pip install -r requirements.txt
```python
import asyncio
import os
$ python -m examples.agents.client <host> <port>
from llama_stack_client import LlamaStackClient
from llama_stack_client.lib.agents.agent import Agent
from llama_stack_client.lib.agents.event_logger import EventLogger
from llama_stack_client.types import Attachment
from llama_stack_client.types.agent_create_params import AgentConfig
async def run_main():
urls = ["chat.rst", "llama3.rst", "datasets.rst", "lora_finetune.rst"]
attachments = [
Attachment(
content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
mime_type="text/plain",
)
for i, url in enumerate(urls)
]
client = LlamaStackClient(base_url=f"http://localhost:{os.environ['LLAMA_STACK_PORT']}")
agent_config = AgentConfig(
model=os.environ["INFERENCE_MODEL"],
instructions="You are a helpful assistant",
tools=[{"type": "memory"}], # enable Memory aka RAG
)
agent = Agent(client, agent_config)
session_id = agent.create_session("test-session")
print(f"Created session_id={session_id} for Agent({agent.agent_id})")
user_prompts = [
(
"I am attaching documentation for Torchtune. Help me answer questions I will ask next.",
attachments,
),
(
"What are the top 5 topics that were explained? Only list succinct bullet points.",
None,
),
]
for prompt, attachments in user_prompts:
response = agent.create_turn(
messages=[{"role": "user", "content": prompt}],
attachments=attachments,
session_id=session_id,
)
async for log in EventLogger().log(response):
log.print()
if __name__ == "__main__":
asyncio.run(run_main())
```
You will see outputs of the form --
```
User> I am planning a trip to Switzerland, what are the top 3 places to visit?
inference> Switzerland is a beautiful country with a rich history, stunning landscapes, and vibrant culture. Here are three must-visit places to add to your itinerary:
...
## Next Steps
User> What is so special about #1?
inference> Jungfraujoch, also known as the "Top of Europe," is a unique and special place for several reasons:
...
User> What other countries should I consider to club?
inference> Considering your interest in Switzerland, here are some neighboring countries that you may want to consider visiting:
```
- Learn more about Llama Stack [Concepts](../concepts/index.md)
- Learn how to [Build Llama Stacks](../distributions/index.md)
- See [References](../references/index.md) for more details about the llama CLI and Python SDK
- For example applications and more detailed tutorials, visit our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repository.

View file

@ -1,50 +1,48 @@
# Llama Stack
Llama Stack defines and standardizes the building blocks needed to bring generative AI applications to market. It empowers developers building agentic applications by giving them options to operate in various environments (on-prem, cloud, single-node, on-device) while relying on a standard API interface and developer experience that's certified by Meta.
The Stack APIs are rapidly improving but still a work-in-progress. We invite feedback as well as direct contributions.
Llama Stack defines and standardizes the set of core building blocks needed to bring generative AI applications to market. These building blocks are presented in the form of interoperable APIs with a broad set of Service Providers providing their implementations.
```{image} ../_static/llama-stack.png
:alt: Llama Stack
:width: 600px
:align: center
:width: 400px
```
## APIs
Our goal is to provide pre-packaged implementations which can be operated in a variety of deployment environments: developers start iterating with Desktops or their mobile devices and can seamlessly transition to on-prem or public cloud deployments. At every point in this transition, the same set of APIs and the same developer experience is available.
The set of APIs in Llama Stack can be roughly split into two broad categories:
```{note}
The Stack APIs are rapidly improving but still a work-in-progress. We invite feedback as well as direct contributions.
```
- APIs focused on Application development
- Inference
- Safety
- Memory
- Agentic System
- Evaluation
## Philosophy
- APIs focused on Model development
- Evaluation
- Post Training
- Synthetic Data Generation
- Reward Scoring
### Service-oriented design
Each API is a collection of REST endpoints.
Unlike other frameworks, Llama Stack is built with a service-oriented, REST API-first approach. Such a design not only allows for seamless transitions from a local to remote deployments, but also forces the design to be more declarative. We believe this restriction can result in a much simpler, robust developer experience. This will necessarily trade-off against expressivity however if we get the APIs right, it can lead to a very powerful platform.
## API Providers
### Composability
A Provider is what makes the API real they provide the actual implementation backing the API.
We expect the set of APIs we design to be composable. An Agent abstractly depends on { Inference, Memory, Safety } APIs but does not care about the actual implementation details. Safety itself may require model inference and hence can depend on the Inference API.
As an example, for Inference, we could have the implementation be backed by open source libraries like [ torch | vLLM | TensorRT ] as possible options.
### Turnkey one-stop solutions
A provider can also be a relay to a remote REST service ex. cloud providers or dedicated inference providers that serve these APIs.
We expect to provide turnkey solutions for popular deployment scenarios. It should be easy to deploy a Llama Stack server on AWS or on a private data center. Either of these should allow a developer to get started with powerful agentic apps, model evaluations or fine-tuning services in a matter of minutes. They should all result in the same uniform observability and developer experience.
## Distribution
### Focus on Llama models
As a Meta initiated project, we have started by explicitly focusing on Meta's Llama series of models. Supporting the broad set of open models is no easy task and we want to start with models we understand best.
### Supporting the Ecosystem
There is a vibrant ecosystem of Providers which provide efficient inference or scalable vector stores or powerful observability solutions. We want to make sure it is easy for developers to pick and choose the best implementations for their use cases. We also want to make sure it is easy for new Providers to onboard and participate in the ecosystem.
Additionally, we have designed every element of the Stack such that APIs as well as Resources (like Models) can be federated.
A Distribution is where APIs and Providers are assembled together to provide a consistent whole to the end application developer. You can mix-and-match providers some could be backed by local code and some could be remote. As a hobbyist, you can serve a small model locally, but can choose a cloud provider for a large model. Regardless, the higher level APIs your app needs to work with don't need to change at all. You can even imagine moving across the server / mobile-device boundary as well always using the same uniform set of APIs for developing Generative AI applications.
## Supported Llama Stack Implementations
### API Providers
| **API Provider Builder** | **Environments** | **Agents** | **Inference** | **Memory** | **Safety** | **Telemetry** |
Llama Stack already has a number of "adapters" available for some popular Inference and Memory (Vector Store) providers. For other APIs (particularly Safety and Agents), we provide *reference implementations* you can use to get started. We expect this list to grow over time. We are slowly onboarding more providers to the ecosystem as we get more confidence in the APIs.
| **API Provider** | **Environments** | **Agents** | **Inference** | **Memory** | **Safety** | **Telemetry** |
| :----: | :----: | :----: | :----: | :----: | :----: | :----: |
| Meta Reference | Single Node | Y | Y | Y | Y | Y |
| Fireworks | Hosted | Y | Y | Y | | |
@ -53,21 +51,17 @@ A Distribution is where APIs and Providers are assembled together to provide a c
| Ollama | Single Node | | Y | | |
| TGI | Hosted and Single Node | | Y | | |
| Chroma | Single Node | | | Y | | |
| PG Vector | Single Node | | | Y | | |
| Postgres | Single Node | | | Y | | |
| PyTorch ExecuTorch | On-device iOS | Y | Y | | |
### Distributions
## Dive In
| **Distribution** | **Llama Stack Docker** | Start This Distribution | **Inference** | **Agents** | **Memory** | **Safety** | **Telemetry** |
|:----------------: |:------------------------------------------: |:-----------------------: |:------------------: |:------------------: |:------------------: |:------------------: |:------------------: |
| Meta Reference | [llamastack/distribution-meta-reference-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-gpu/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/meta-reference-gpu.html) | meta-reference | meta-reference | meta-reference; remote::pgvector; remote::chromadb | meta-reference | meta-reference |
| Meta Reference Quantized | [llamastack/distribution-meta-reference-quantized-gpu](https://hub.docker.com/repository/docker/llamastack/distribution-meta-reference-quantized-gpu/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/meta-reference-quantized-gpu.html) | meta-reference-quantized | meta-reference | meta-reference; remote::pgvector; remote::chromadb | meta-reference | meta-reference |
| Ollama | [llamastack/distribution-ollama](https://hub.docker.com/repository/docker/llamastack/distribution-ollama/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/ollama.html) | remote::ollama | meta-reference | remote::pgvector; remote::chromadb | meta-reference | meta-reference |
| TGI | [llamastack/distribution-tgi](https://hub.docker.com/repository/docker/llamastack/distribution-tgi/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/self_hosted_distro/tgi.html) | remote::tgi | meta-reference | meta-reference; remote::pgvector; remote::chromadb | meta-reference | meta-reference |
| Together | [llamastack/distribution-together](https://hub.docker.com/repository/docker/llamastack/distribution-together/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/remote_hosted_distro/together.html) | remote::together | meta-reference | remote::weaviate | meta-reference | meta-reference |
| Fireworks | [llamastack/distribution-fireworks](https://hub.docker.com/repository/docker/llamastack/distribution-fireworks/general) | [Guide](https://llama-stack.readthedocs.io/en/latest/getting_started/distributions/remote_hosted_distro/fireworks.html) | remote::fireworks | meta-reference | remote::weaviate | meta-reference | meta-reference |
- Look at [Quick Start](getting_started/index) section to get started with Llama Stack.
- Learn more about [Llama Stack Concepts](concepts/index) to understand how different components fit together.
- Check out [Zero to Hero](https://github.com/meta-llama/llama-stack/tree/main/docs/zero_to_hero_guide) guide to learn in details about how to build your first agent.
- See how you can use [Llama Stack Distributions](distributions/index) to get started with popular inference and other service providers.
## Llama Stack Client SDK
We also provide a number of Client side SDKs to make it easier to connect to Llama Stack server in your preferred language.
| **Language** | **Client SDK** | **Package** |
| :----: | :----: | :----: |
@ -76,18 +70,17 @@ A Distribution is where APIs and Providers are assembled together to provide a c
| Node | [llama-stack-client-node](https://github.com/meta-llama/llama-stack-client-node) | [![NPM version](https://img.shields.io/npm/v/llama-stack-client.svg)](https://npmjs.org/package/llama-stack-client)
| Kotlin | [llama-stack-client-kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) | [![Maven version](https://img.shields.io/maven-central/v/com.llama.llamastack/llama-stack-client-kotlin)](https://central.sonatype.com/artifact/com.llama.llamastack/llama-stack-client-kotlin)
Check out our client SDKs for connecting to Llama Stack server in your preferred language, you can choose from [python](https://github.com/meta-llama/llama-stack-client-python), [node](https://github.com/meta-llama/llama-stack-client-node), [swift](https://github.com/meta-llama/llama-stack-client-swift), and [kotlin](https://github.com/meta-llama/llama-stack-client-kotlin) programming languages to quickly build your applications.
You can find more example scripts with client SDKs to talk with the Llama Stack server in our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repo.
```{toctree}
:hidden:
:maxdepth: 3
getting_started/index
cli_reference/index
cli_reference/download_models
api_providers/index
distribution_dev/index
concepts/index
distributions/index
building_applications/index
contributing/index
references/index
cookbooks/index
```

View file

@ -0,0 +1,7 @@
# API Reference
```{eval-rst}
.. sphinxcontrib-redoc:: ../resources/llama-stack-spec.yaml
:page-title: API Reference
:expand-responses: all
```

View file

@ -0,0 +1,17 @@
# References
- [API Reference](api_reference/index) for the Llama Stack API specification
- [Python SDK Reference](python_sdk_reference/index)
- [Llama CLI](llama_cli_reference/index) for building and running your Llama Stack server
- [Llama Stack Client CLI](llama_stack_client_cli_reference) for interacting with your Llama Stack server
```{toctree}
:maxdepth: 1
:hidden:
api_reference/index
python_sdk_reference/index
llama_cli_reference/index
llama_stack_client_cli_reference
llama_cli_reference/download_models
```

View file

@ -1,4 +1,4 @@
# CLI Reference
# llama (server-side) CLI Reference
The `llama` CLI tool helps you setup and use the Llama Stack. It should be available on your path after installing the `llama-stack` package.
@ -29,7 +29,7 @@ You have two ways to install Llama Stack:
## `llama` subcommands
1. `download`: `llama` cli tools supports downloading the model from Meta or Hugging Face.
2. `model`: Lists available models and their properties.
3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this [here](../distribution_dev/building_distro.md).
3. `stack`: Allows you to build and run a Llama Stack server. You can read more about this [here](../../distributions/building_distro).
### Sample Usage
@ -119,7 +119,7 @@ You should see a table like this:
To download models, you can use the llama download command.
#### Downloading from [Meta](https://llama.meta.com/llama-downloads/)
### Downloading from [Meta](https://llama.meta.com/llama-downloads/)
Here is an example download command to get the 3B-Instruct/11B-Vision-Instruct model. You will need META_URL which can be obtained from [here](https://llama.meta.com/docs/getting_the_models/meta/)
@ -137,7 +137,7 @@ llama download --source meta --model-id Prompt-Guard-86M --meta-url META_URL
llama download --source meta --model-id Llama-Guard-3-1B --meta-url META_URL
```
#### Downloading from [Hugging Face](https://huggingface.co/meta-llama)
### Downloading from [Hugging Face](https://huggingface.co/meta-llama)
Essentially, the same commands above work, just replace `--source meta` with `--source huggingface`.
@ -228,7 +228,7 @@ You can even run `llama model prompt-format` see all of the templates and their
```
llama model prompt-format -m Llama3.2-3B-Instruct
```
![alt text](../../resources/prompt-format.png)
![alt text](../../../resources/prompt-format.png)

View file

@ -0,0 +1,223 @@
# llama (client-side) CLI Reference
The `llama-stack-client` CLI allows you to query information about the distribution.
## Basic Commands
### `llama-stack-client`
```bash
$ llama-stack-client -h
usage: llama-stack-client [-h] {models,memory_banks,shields} ...
Welcome to the LlamaStackClient CLI
options:
-h, --help show this help message and exit
subcommands:
{models,memory_banks,shields}
```
### `llama-stack-client configure`
```bash
$ llama-stack-client configure
> Enter the host name of the Llama Stack distribution server: localhost
> Enter the port number of the Llama Stack distribution server: 5000
Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:5000
```
### `llama-stack-client providers list`
```bash
$ llama-stack-client providers list
```
```
+-----------+----------------+-----------------+
| API | Provider ID | Provider Type |
+===========+================+=================+
| scoring | meta0 | meta-reference |
+-----------+----------------+-----------------+
| datasetio | meta0 | meta-reference |
+-----------+----------------+-----------------+
| inference | tgi0 | remote::tgi |
+-----------+----------------+-----------------+
| memory | meta-reference | meta-reference |
+-----------+----------------+-----------------+
| agents | meta-reference | meta-reference |
+-----------+----------------+-----------------+
| telemetry | meta-reference | meta-reference |
+-----------+----------------+-----------------+
| safety | meta-reference | meta-reference |
+-----------+----------------+-----------------+
```
## Model Management
### `llama-stack-client models list`
```bash
$ llama-stack-client models list
```
```
+----------------------+----------------------+---------------+----------------------------------------------------------+
| identifier | llama_model | provider_id | metadata |
+======================+======================+===============+==========================================================+
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | tgi0 | {'huggingface_repo': 'meta-llama/Llama-3.1-8B-Instruct'} |
+----------------------+----------------------+---------------+----------------------------------------------------------+
```
### `llama-stack-client models get`
```bash
$ llama-stack-client models get Llama3.1-8B-Instruct
```
```
+----------------------+----------------------+----------------------------------------------------------+---------------+
| identifier | llama_model | metadata | provider_id |
+======================+======================+==========================================================+===============+
| Llama3.1-8B-Instruct | Llama3.1-8B-Instruct | {'huggingface_repo': 'meta-llama/Llama-3.1-8B-Instruct'} | tgi0 |
+----------------------+----------------------+----------------------------------------------------------+---------------+
```
```bash
$ llama-stack-client models get Random-Model
Model RandomModel is not found at distribution endpoint host:port. Please ensure endpoint is serving specified model.
```
### `llama-stack-client models register`
```bash
$ llama-stack-client models register <model_id> [--provider-id <provider_id>] [--provider-model-id <provider_model_id>] [--metadata <metadata>]
```
### `llama-stack-client models update`
```bash
$ llama-stack-client models update <model_id> [--provider-id <provider_id>] [--provider-model-id <provider_model_id>] [--metadata <metadata>]
```
### `llama-stack-client models delete`
```bash
$ llama-stack-client models delete <model_id>
```
## Memory Bank Management
### `llama-stack-client memory_banks list`
```bash
$ llama-stack-client memory_banks list
```
```
+--------------+----------------+--------+-------------------+------------------------+--------------------------+
| identifier | provider_id | type | embedding_model | chunk_size_in_tokens | overlap_size_in_tokens |
+==============+================+========+===================+========================+==========================+
| test_bank | meta-reference | vector | all-MiniLM-L6-v2 | 512 | 64 |
+--------------+----------------+--------+-------------------+------------------------+--------------------------+
```
### `llama-stack-client memory_banks register`
```bash
$ llama-stack-client memory_banks register <memory-bank-id> --type <type> [--provider-id <provider-id>] [--provider-memory-bank-id <provider-memory-bank-id>] [--chunk-size <chunk-size>] [--embedding-model <embedding-model>] [--overlap-size <overlap-size>]
```
Options:
- `--type`: Required. Type of memory bank. Choices: "vector", "keyvalue", "keyword", "graph"
- `--provider-id`: Optional. Provider ID for the memory bank
- `--provider-memory-bank-id`: Optional. Provider's memory bank ID
- `--chunk-size`: Optional. Chunk size in tokens (for vector type). Default: 512
- `--embedding-model`: Optional. Embedding model (for vector type). Default: "all-MiniLM-L6-v2"
- `--overlap-size`: Optional. Overlap size in tokens (for vector type). Default: 64
### `llama-stack-client memory_banks unregister`
```bash
$ llama-stack-client memory_banks unregister <memory-bank-id>
```
## Shield Management
### `llama-stack-client shields list`
```bash
$ llama-stack-client shields list
```
```
+--------------+----------+----------------+-------------+
| identifier | params | provider_id | type |
+==============+==========+================+=============+
| llama_guard | {} | meta-reference | llama_guard |
+--------------+----------+----------------+-------------+
```
### `llama-stack-client shields register`
```bash
$ llama-stack-client shields register --shield-id <shield-id> [--provider-id <provider-id>] [--provider-shield-id <provider-shield-id>] [--params <params>]
```
Options:
- `--shield-id`: Required. ID of the shield
- `--provider-id`: Optional. Provider ID for the shield
- `--provider-shield-id`: Optional. Provider's shield ID
- `--params`: Optional. JSON configuration parameters for the shield
## Eval Task Management
### `llama-stack-client eval_tasks list`
```bash
$ llama-stack-client eval_tasks list
```
### `llama-stack-client eval_tasks register`
```bash
$ llama-stack-client eval_tasks register --eval-task-id <eval-task-id> --dataset-id <dataset-id> --scoring-functions <function1> [<function2> ...] [--provider-id <provider-id>] [--provider-eval-task-id <provider-eval-task-id>] [--metadata <metadata>]
```
Options:
- `--eval-task-id`: Required. ID of the eval task
- `--dataset-id`: Required. ID of the dataset to evaluate
- `--scoring-functions`: Required. One or more scoring functions to use for evaluation
- `--provider-id`: Optional. Provider ID for the eval task
- `--provider-eval-task-id`: Optional. Provider's eval task ID
- `--metadata`: Optional. Metadata for the eval task in JSON format
## Eval execution
### `llama-stack-client eval run-benchmark`
```bash
$ llama-stack-client eval run-benchmark <eval-task-id1> [<eval-task-id2> ...] --eval-task-config <config-file> --output-dir <output-dir> [--num-examples <num>] [--visualize]
```
Options:
- `--eval-task-config`: Required. Path to the eval task config file in JSON format
- `--output-dir`: Required. Path to the directory where evaluation results will be saved
- `--num-examples`: Optional. Number of examples to evaluate (useful for debugging)
- `--visualize`: Optional flag. If set, visualizes evaluation results after completion
Example eval_task_config.json:
```json
{
"type": "benchmark",
"eval_candidate": {
"type": "model",
"model": "Llama3.1-405B-Instruct",
"sampling_params": {
"strategy": "greedy",
"temperature": 0,
"top_p": 0.95,
"top_k": 0,
"max_tokens": 0,
"repetition_penalty": 1.0
}
}
}
```
### `llama-stack-client eval run-scoring`
```bash
$ llama-stack-client eval run-scoring <eval-task-id> --eval-task-config <config-file> --output-dir <output-dir> [--num-examples <num>] [--visualize]
```
Options:
- `--eval-task-config`: Required. Path to the eval task config file in JSON format
- `--output-dir`: Required. Path to the directory where scoring results will be saved
- `--num-examples`: Optional. Number of examples to evaluate (useful for debugging)
- `--visualize`: Optional flag. If set, visualizes scoring results after completion

View file

@ -0,0 +1,348 @@
# Python SDK Reference
## Shared Types
```python
from llama_stack_client.types import (
Attachment,
BatchCompletion,
CompletionMessage,
SamplingParams,
SystemMessage,
ToolCall,
ToolResponseMessage,
UserMessage,
)
```
## Telemetry
Types:
```python
from llama_stack_client.types import TelemetryGetTraceResponse
```
Methods:
- <code title="get /telemetry/get_trace">client.telemetry.<a href="./src/llama_stack_client/resources/telemetry.py">get_trace</a>(\*\*<a href="src/llama_stack_client/types/telemetry_get_trace_params.py">params</a>) -> <a href="./src/llama_stack_client/types/telemetry_get_trace_response.py">TelemetryGetTraceResponse</a></code>
- <code title="post /telemetry/log_event">client.telemetry.<a href="./src/llama_stack_client/resources/telemetry.py">log</a>(\*\*<a href="src/llama_stack_client/types/telemetry_log_params.py">params</a>) -> None</code>
## Agents
Types:
```python
from llama_stack_client.types import (
InferenceStep,
MemoryRetrievalStep,
RestAPIExecutionConfig,
ShieldCallStep,
ToolExecutionStep,
ToolParamDefinition,
AgentCreateResponse,
)
```
Methods:
- <code title="post /agents/create">client.agents.<a href="./src/llama_stack_client/resources/agents/agents.py">create</a>(\*\*<a href="src/llama_stack_client/types/agent_create_params.py">params</a>) -> <a href="./src/llama_stack_client/types/agent_create_response.py">AgentCreateResponse</a></code>
- <code title="post /agents/delete">client.agents.<a href="./src/llama_stack_client/resources/agents/agents.py">delete</a>(\*\*<a href="src/llama_stack_client/types/agent_delete_params.py">params</a>) -> None</code>
### Sessions
Types:
```python
from llama_stack_client.types.agents import Session, SessionCreateResponse
```
Methods:
- <code title="post /agents/session/create">client.agents.sessions.<a href="./src/llama_stack_client/resources/agents/sessions.py">create</a>(\*\*<a href="src/llama_stack_client/types/agents/session_create_params.py">params</a>) -> <a href="./src/llama_stack_client/types/agents/session_create_response.py">SessionCreateResponse</a></code>
- <code title="post /agents/session/get">client.agents.sessions.<a href="./src/llama_stack_client/resources/agents/sessions.py">retrieve</a>(\*\*<a href="src/llama_stack_client/types/agents/session_retrieve_params.py">params</a>) -> <a href="./src/llama_stack_client/types/agents/session.py">Session</a></code>
- <code title="post /agents/session/delete">client.agents.sessions.<a href="./src/llama_stack_client/resources/agents/sessions.py">delete</a>(\*\*<a href="src/llama_stack_client/types/agents/session_delete_params.py">params</a>) -> None</code>
### Steps
Types:
```python
from llama_stack_client.types.agents import AgentsStep
```
Methods:
- <code title="get /agents/step/get">client.agents.steps.<a href="./src/llama_stack_client/resources/agents/steps.py">retrieve</a>(\*\*<a href="src/llama_stack_client/types/agents/step_retrieve_params.py">params</a>) -> <a href="./src/llama_stack_client/types/agents/agents_step.py">AgentsStep</a></code>
### Turns
Types:
```python
from llama_stack_client.types.agents import AgentsTurnStreamChunk, Turn, TurnStreamEvent
```
Methods:
- <code title="post /agents/turn/create">client.agents.turns.<a href="./src/llama_stack_client/resources/agents/turns.py">create</a>(\*\*<a href="src/llama_stack_client/types/agents/turn_create_params.py">params</a>) -> <a href="./src/llama_stack_client/types/agents/agents_turn_stream_chunk.py">AgentsTurnStreamChunk</a></code>
- <code title="get /agents/turn/get">client.agents.turns.<a href="./src/llama_stack_client/resources/agents/turns.py">retrieve</a>(\*\*<a href="src/llama_stack_client/types/agents/turn_retrieve_params.py">params</a>) -> <a href="./src/llama_stack_client/types/agents/turn.py">Turn</a></code>
## Datasets
Types:
```python
from llama_stack_client.types import TrainEvalDataset
```
Methods:
- <code title="post /datasets/create">client.datasets.<a href="./src/llama_stack_client/resources/datasets.py">create</a>(\*\*<a href="src/llama_stack_client/types/dataset_create_params.py">params</a>) -> None</code>
- <code title="post /datasets/delete">client.datasets.<a href="./src/llama_stack_client/resources/datasets.py">delete</a>(\*\*<a href="src/llama_stack_client/types/dataset_delete_params.py">params</a>) -> None</code>
- <code title="get /datasets/get">client.datasets.<a href="./src/llama_stack_client/resources/datasets.py">get</a>(\*\*<a href="src/llama_stack_client/types/dataset_get_params.py">params</a>) -> <a href="./src/llama_stack_client/types/train_eval_dataset.py">TrainEvalDataset</a></code>
## Evaluate
Types:
```python
from llama_stack_client.types import EvaluationJob
```
### Jobs
Types:
```python
from llama_stack_client.types.evaluate import (
EvaluationJobArtifacts,
EvaluationJobLogStream,
EvaluationJobStatus,
)
```
Methods:
- <code title="get /evaluate/jobs">client.evaluate.jobs.<a href="./src/llama_stack_client/resources/evaluate/jobs/jobs.py">list</a>() -> <a href="./src/llama_stack_client/types/evaluation_job.py">EvaluationJob</a></code>
- <code title="post /evaluate/job/cancel">client.evaluate.jobs.<a href="./src/llama_stack_client/resources/evaluate/jobs/jobs.py">cancel</a>(\*\*<a href="src/llama_stack_client/types/evaluate/job_cancel_params.py">params</a>) -> None</code>
#### Artifacts
Methods:
- <code title="get /evaluate/job/artifacts">client.evaluate.jobs.artifacts.<a href="./src/llama_stack_client/resources/evaluate/jobs/artifacts.py">list</a>(\*\*<a href="src/llama_stack_client/types/evaluate/jobs/artifact_list_params.py">params</a>) -> <a href="./src/llama_stack_client/types/evaluate/evaluation_job_artifacts.py">EvaluationJobArtifacts</a></code>
#### Logs
Methods:
- <code title="get /evaluate/job/logs">client.evaluate.jobs.logs.<a href="./src/llama_stack_client/resources/evaluate/jobs/logs.py">list</a>(\*\*<a href="src/llama_stack_client/types/evaluate/jobs/log_list_params.py">params</a>) -> <a href="./src/llama_stack_client/types/evaluate/evaluation_job_log_stream.py">EvaluationJobLogStream</a></code>
#### Status
Methods:
- <code title="get /evaluate/job/status">client.evaluate.jobs.status.<a href="./src/llama_stack_client/resources/evaluate/jobs/status.py">list</a>(\*\*<a href="src/llama_stack_client/types/evaluate/jobs/status_list_params.py">params</a>) -> <a href="./src/llama_stack_client/types/evaluate/evaluation_job_status.py">EvaluationJobStatus</a></code>
### QuestionAnswering
Methods:
- <code title="post /evaluate/question_answering/">client.evaluate.question_answering.<a href="./src/llama_stack_client/resources/evaluate/question_answering.py">create</a>(\*\*<a href="src/llama_stack_client/types/evaluate/question_answering_create_params.py">params</a>) -> <a href="./src/llama_stack_client/types/evaluation_job.py">EvaluationJob</a></code>
## Evaluations
Methods:
- <code title="post /evaluate/summarization/">client.evaluations.<a href="./src/llama_stack_client/resources/evaluations.py">summarization</a>(\*\*<a href="src/llama_stack_client/types/evaluation_summarization_params.py">params</a>) -> <a href="./src/llama_stack_client/types/evaluation_job.py">EvaluationJob</a></code>
- <code title="post /evaluate/text_generation/">client.evaluations.<a href="./src/llama_stack_client/resources/evaluations.py">text_generation</a>(\*\*<a href="src/llama_stack_client/types/evaluation_text_generation_params.py">params</a>) -> <a href="./src/llama_stack_client/types/evaluation_job.py">EvaluationJob</a></code>
## Inference
Types:
```python
from llama_stack_client.types import (
ChatCompletionStreamChunk,
CompletionStreamChunk,
TokenLogProbs,
InferenceChatCompletionResponse,
InferenceCompletionResponse,
)
```
Methods:
- <code title="post /inference/chat_completion">client.inference.<a href="./src/llama_stack_client/resources/inference/inference.py">chat_completion</a>(\*\*<a href="src/llama_stack_client/types/inference_chat_completion_params.py">params</a>) -> <a href="./src/llama_stack_client/types/inference_chat_completion_response.py">InferenceChatCompletionResponse</a></code>
- <code title="post /inference/completion">client.inference.<a href="./src/llama_stack_client/resources/inference/inference.py">completion</a>(\*\*<a href="src/llama_stack_client/types/inference_completion_params.py">params</a>) -> <a href="./src/llama_stack_client/types/inference_completion_response.py">InferenceCompletionResponse</a></code>
### Embeddings
Types:
```python
from llama_stack_client.types.inference import Embeddings
```
Methods:
- <code title="post /inference/embeddings">client.inference.embeddings.<a href="./src/llama_stack_client/resources/inference/embeddings.py">create</a>(\*\*<a href="src/llama_stack_client/types/inference/embedding_create_params.py">params</a>) -> <a href="./src/llama_stack_client/types/inference/embeddings.py">Embeddings</a></code>
## Safety
Types:
```python
from llama_stack_client.types import RunSheidResponse
```
Methods:
- <code title="post /safety/run_shield">client.safety.<a href="./src/llama_stack_client/resources/safety.py">run_shield</a>(\*\*<a href="src/llama_stack_client/types/safety_run_shield_params.py">params</a>) -> <a href="./src/llama_stack_client/types/run_sheid_response.py">RunSheidResponse</a></code>
## Memory
Types:
```python
from llama_stack_client.types import (
QueryDocuments,
MemoryCreateResponse,
MemoryRetrieveResponse,
MemoryListResponse,
MemoryDropResponse,
)
```
Methods:
- <code title="post /memory/create">client.memory.<a href="./src/llama_stack_client/resources/memory/memory.py">create</a>(\*\*<a href="src/llama_stack_client/types/memory_create_params.py">params</a>) -> <a href="./src/llama_stack_client/types/memory_create_response.py">object</a></code>
- <code title="get /memory/get">client.memory.<a href="./src/llama_stack_client/resources/memory/memory.py">retrieve</a>(\*\*<a href="src/llama_stack_client/types/memory_retrieve_params.py">params</a>) -> <a href="./src/llama_stack_client/types/memory_retrieve_response.py">object</a></code>
- <code title="post /memory/update">client.memory.<a href="./src/llama_stack_client/resources/memory/memory.py">update</a>(\*\*<a href="src/llama_stack_client/types/memory_update_params.py">params</a>) -> None</code>
- <code title="get /memory/list">client.memory.<a href="./src/llama_stack_client/resources/memory/memory.py">list</a>() -> <a href="./src/llama_stack_client/types/memory_list_response.py">object</a></code>
- <code title="post /memory/drop">client.memory.<a href="./src/llama_stack_client/resources/memory/memory.py">drop</a>(\*\*<a href="src/llama_stack_client/types/memory_drop_params.py">params</a>) -> str</code>
- <code title="post /memory/insert">client.memory.<a href="./src/llama_stack_client/resources/memory/memory.py">insert</a>(\*\*<a href="src/llama_stack_client/types/memory_insert_params.py">params</a>) -> None</code>
- <code title="post /memory/query">client.memory.<a href="./src/llama_stack_client/resources/memory/memory.py">query</a>(\*\*<a href="src/llama_stack_client/types/memory_query_params.py">params</a>) -> <a href="./src/llama_stack_client/types/query_documents.py">QueryDocuments</a></code>
### Documents
Types:
```python
from llama_stack_client.types.memory import DocumentRetrieveResponse
```
Methods:
- <code title="post /memory/documents/get">client.memory.documents.<a href="./src/llama_stack_client/resources/memory/documents.py">retrieve</a>(\*\*<a href="src/llama_stack_client/types/memory/document_retrieve_params.py">params</a>) -> <a href="./src/llama_stack_client/types/memory/document_retrieve_response.py">DocumentRetrieveResponse</a></code>
- <code title="post /memory/documents/delete">client.memory.documents.<a href="./src/llama_stack_client/resources/memory/documents.py">delete</a>(\*\*<a href="src/llama_stack_client/types/memory/document_delete_params.py">params</a>) -> None</code>
## PostTraining
Types:
```python
from llama_stack_client.types import PostTrainingJob
```
Methods:
- <code title="post /post_training/preference_optimize">client.post_training.<a href="./src/llama_stack_client/resources/post_training/post_training.py">preference_optimize</a>(\*\*<a href="src/llama_stack_client/types/post_training_preference_optimize_params.py">params</a>) -> <a href="./src/llama_stack_client/types/post_training_job.py">PostTrainingJob</a></code>
- <code title="post /post_training/supervised_fine_tune">client.post_training.<a href="./src/llama_stack_client/resources/post_training/post_training.py">supervised_fine_tune</a>(\*\*<a href="src/llama_stack_client/types/post_training_supervised_fine_tune_params.py">params</a>) -> <a href="./src/llama_stack_client/types/post_training_job.py">PostTrainingJob</a></code>
### Jobs
Types:
```python
from llama_stack_client.types.post_training import (
PostTrainingJobArtifacts,
PostTrainingJobLogStream,
PostTrainingJobStatus,
)
```
Methods:
- <code title="get /post_training/jobs">client.post_training.jobs.<a href="./src/llama_stack_client/resources/post_training/jobs.py">list</a>() -> <a href="./src/llama_stack_client/types/post_training_job.py">PostTrainingJob</a></code>
- <code title="get /post_training/job/artifacts">client.post_training.jobs.<a href="./src/llama_stack_client/resources/post_training/jobs.py">artifacts</a>(\*\*<a href="src/llama_stack_client/types/post_training/job_artifacts_params.py">params</a>) -> <a href="./src/llama_stack_client/types/post_training/post_training_job_artifacts.py">PostTrainingJobArtifacts</a></code>
- <code title="post /post_training/job/cancel">client.post_training.jobs.<a href="./src/llama_stack_client/resources/post_training/jobs.py">cancel</a>(\*\*<a href="src/llama_stack_client/types/post_training/job_cancel_params.py">params</a>) -> None</code>
- <code title="get /post_training/job/logs">client.post_training.jobs.<a href="./src/llama_stack_client/resources/post_training/jobs.py">logs</a>(\*\*<a href="src/llama_stack_client/types/post_training/job_logs_params.py">params</a>) -> <a href="./src/llama_stack_client/types/post_training/post_training_job_log_stream.py">PostTrainingJobLogStream</a></code>
- <code title="get /post_training/job/status">client.post_training.jobs.<a href="./src/llama_stack_client/resources/post_training/jobs.py">status</a>(\*\*<a href="src/llama_stack_client/types/post_training/job_status_params.py">params</a>) -> <a href="./src/llama_stack_client/types/post_training/post_training_job_status.py">PostTrainingJobStatus</a></code>
## RewardScoring
Types:
```python
from llama_stack_client.types import RewardScoring, ScoredDialogGenerations
```
Methods:
- <code title="post /reward_scoring/score">client.reward_scoring.<a href="./src/llama_stack_client/resources/reward_scoring.py">score</a>(\*\*<a href="src/llama_stack_client/types/reward_scoring_score_params.py">params</a>) -> <a href="./src/llama_stack_client/types/reward_scoring.py">RewardScoring</a></code>
## SyntheticDataGeneration
Types:
```python
from llama_stack_client.types import SyntheticDataGeneration
```
Methods:
- <code title="post /synthetic_data_generation/generate">client.synthetic_data_generation.<a href="./src/llama_stack_client/resources/synthetic_data_generation.py">generate</a>(\*\*<a href="src/llama_stack_client/types/synthetic_data_generation_generate_params.py">params</a>) -> <a href="./src/llama_stack_client/types/synthetic_data_generation.py">SyntheticDataGeneration</a></code>
## BatchInference
Types:
```python
from llama_stack_client.types import BatchChatCompletion
```
Methods:
- <code title="post /batch_inference/chat_completion">client.batch_inference.<a href="./src/llama_stack_client/resources/batch_inference.py">chat_completion</a>(\*\*<a href="src/llama_stack_client/types/batch_inference_chat_completion_params.py">params</a>) -> <a href="./src/llama_stack_client/types/batch_chat_completion.py">BatchChatCompletion</a></code>
- <code title="post /batch_inference/completion">client.batch_inference.<a href="./src/llama_stack_client/resources/batch_inference.py">completion</a>(\*\*<a href="src/llama_stack_client/types/batch_inference_completion_params.py">params</a>) -> <a href="./src/llama_stack_client/types/shared/batch_completion.py">BatchCompletion</a></code>
## Models
Types:
```python
from llama_stack_client.types import ModelServingSpec
```
Methods:
- <code title="get /models/list">client.models.<a href="./src/llama_stack_client/resources/models.py">list</a>() -> <a href="./src/llama_stack_client/types/model_serving_spec.py">ModelServingSpec</a></code>
- <code title="get /models/get">client.models.<a href="./src/llama_stack_client/resources/models.py">get</a>(\*\*<a href="src/llama_stack_client/types/model_get_params.py">params</a>) -> <a href="./src/llama_stack_client/types/model_serving_spec.py">Optional</a></code>
## MemoryBanks
Types:
```python
from llama_stack_client.types import MemoryBankSpec
```
Methods:
- <code title="get /memory_banks/list">client.memory_banks.<a href="./src/llama_stack_client/resources/memory_banks.py">list</a>() -> <a href="./src/llama_stack_client/types/memory_bank_spec.py">MemoryBankSpec</a></code>
- <code title="get /memory_banks/get">client.memory_banks.<a href="./src/llama_stack_client/resources/memory_banks.py">get</a>(\*\*<a href="src/llama_stack_client/types/memory_bank_get_params.py">params</a>) -> <a href="./src/llama_stack_client/types/memory_bank_spec.py">Optional</a></code>
## Shields
Types:
```python
from llama_stack_client.types import ShieldSpec
```
Methods:
- <code title="get /shields/list">client.shields.<a href="./src/llama_stack_client/resources/shields.py">list</a>() -> <a href="./src/llama_stack_client/types/shield_spec.py">ShieldSpec</a></code>
- <code title="get /shields/get">client.shields.<a href="./src/llama_stack_client/resources/shields.py">get</a>(\*\*<a href="src/llama_stack_client/types/shield_get_params.py">params</a>) -> <a href="./src/llama_stack_client/types/shield_spec.py">Optional</a></code>