mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-07-19 03:10:03 +00:00
More Updates to Read the Docs (#856)
This commit is contained in:
parent
8a686270e9
commit
74e933cbfd
8 changed files with 405 additions and 730 deletions
|
@ -13,24 +13,94 @@ In order to build your own distribution, we recommend you clone the `llama-stack
|
|||
git clone git@github.com:meta-llama/llama-stack.git
|
||||
cd llama-stack
|
||||
pip install -e .
|
||||
|
||||
llama stack build -h
|
||||
```
|
||||
Use the CLI to build your distribution.
|
||||
The main points to consider are:
|
||||
1. **Image Type** - Do you want a Conda / venv environment or a Container (eg. Docker)
|
||||
2. **Template** - Do you want to use a template to build your distribution? or start from scratch ?
|
||||
3. **Config** - Do you want to use a pre-existing config file to build your distribution?
|
||||
|
||||
We will start build our distribution (in the form of a Conda environment, or Container image). In this step, we will specify:
|
||||
- `name`: the name for our distribution (e.g. `my-stack`)
|
||||
- `image_type`: our build image type (`conda | container`)
|
||||
- `distribution_spec`: our distribution specs for specifying API providers
|
||||
- `description`: a short description of the configurations for the distribution
|
||||
- `providers`: specifies the underlying implementation for serving each API endpoint
|
||||
- `image_type`: `conda` | `container` to specify whether to build the distribution in the form of Container image or Conda environment.
|
||||
```
|
||||
llama stack build -h
|
||||
|
||||
usage: llama stack build [-h] [--config CONFIG] [--template TEMPLATE] [--list-templates | --no-list-templates] [--image-type {conda,container,venv}] [--image-name IMAGE_NAME]
|
||||
|
||||
Build a Llama stack container
|
||||
|
||||
options:
|
||||
-h, --help show this help message and exit
|
||||
--config CONFIG Path to a config file to use for the build. You can find example configs in llama_stack/distribution/**/build.yaml.
|
||||
If this argument is not provided, you will be prompted to enter information interactively
|
||||
--template TEMPLATE Name of the example template config to use for build. You may use `llama stack build --list-templates` to check out the available templates
|
||||
--list-templates, --no-list-templates
|
||||
Show the available templates for building a Llama Stack distribution (default: False)
|
||||
--image-type {conda,container,venv}
|
||||
Image Type to use for the build. This can be either conda or container or venv. If not specified, will use the image type from the template config.
|
||||
--image-name IMAGE_NAME
|
||||
[for image-type=conda] Name of the conda environment to use for the build. If
|
||||
not specified, currently active Conda environment will be used. If no Conda
|
||||
environment is active, you must specify a name.
|
||||
```
|
||||
|
||||
After this step is complete, a file named `<name>-build.yaml` and template file `<name>-run.yaml` will be generated and saved at the output file path specified at the end of the command.
|
||||
|
||||
::::{tab-set}
|
||||
:::{tab-item} Building from a template
|
||||
To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers.
|
||||
|
||||
The following command will allow you to see the available templates and their corresponding providers.
|
||||
```
|
||||
llama stack build --list-templates
|
||||
```
|
||||
|
||||
```
|
||||
------------------------------+-----------------------------------------------------------------------------+
|
||||
| Template Name | Description |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| hf-serverless | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| together | Use Together.AI for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| vllm-gpu | Use a built-in vLLM engine for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| experimental-post-training | Experimental template for post training |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| remote-vllm | Use (an external) vLLM server for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| fireworks | Use Fireworks.AI for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| tgi | Use (an external) TGI server for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| bedrock | Use AWS Bedrock for running LLM inference and safety |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| meta-reference-gpu | Use Meta Reference for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| nvidia | Use NVIDIA NIM for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| meta-reference-quantized-gpu | Use Meta Reference with fp8, int4 quantization for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| cerebras | Use Cerebras for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| ollama | Use (an external) Ollama server for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
| hf-endpoint | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|
||||
+------------------------------+-----------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
You may then pick a template to build your distribution with providers fitted to your liking.
|
||||
|
||||
For example, to build a distribution with TGI as the inference provider, you can run:
|
||||
```
|
||||
$ llama stack build --template tgi
|
||||
...
|
||||
You can now edit ~/.llama/distributions/llamastack-tgi/tgi-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-tgi/tgi-run.yaml`
|
||||
```
|
||||
:::
|
||||
:::{tab-item} Building from Scratch
|
||||
|
||||
- For a new user, we could start off with running `llama stack build` which will allow you to a interactively enter wizard where you will be prompted to enter build configurations.
|
||||
If the provided templates do not fit your use case, you could start off with running `llama stack build` which will allow you to a interactively enter wizard where you will be prompted to enter build configurations.
|
||||
|
||||
It would be best to start with a template and understand the structure of the config file and the various concepts ( APIS, providers, resources, etc.) before starting from scratch.
|
||||
```
|
||||
llama stack build
|
||||
|
||||
|
@ -57,272 +127,6 @@ You can now edit ~/.llama/distributions/llamastack-my-local-stack/my-local-stack
|
|||
```
|
||||
:::
|
||||
|
||||
:::{tab-item} Building from a template
|
||||
- To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers.
|
||||
|
||||
The following command will allow you to see the available templates and their corresponding providers.
|
||||
```
|
||||
llama stack build --list-templates
|
||||
```
|
||||
|
||||
```
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| Template Name | Providers | Description |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| tgi | { | Use (an external) TGI server for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::tgi" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| remote-vllm | { | Use (an external) vLLM server for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::vllm" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| vllm-gpu | { | Use a built-in vLLM engine for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "inline::vllm" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| meta-reference-quantized-gpu | { | Use Meta Reference with fp8, int4 quantization for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "inline::meta-reference-quantized" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| meta-reference-gpu | { | Use Meta Reference for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| hf-serverless | { | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::hf::serverless" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| together | { | Use Together.AI for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::together" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| ollama | { | Use (an external) Ollama server for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::ollama" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| bedrock | { | Use AWS Bedrock for running LLM inference and safety |
|
||||
| | "inference": [ | |
|
||||
| | "remote::bedrock" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "remote::bedrock" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| hf-endpoint | { | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::hf::endpoint" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| fireworks | { | Use Fireworks.AI for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::fireworks" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::faiss", | |
|
||||
| | "remote::chromadb", | |
|
||||
| | "remote::pgvector" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
| cerebras | { | Use Cerebras for running LLM inference |
|
||||
| | "inference": [ | |
|
||||
| | "remote::cerebras" | |
|
||||
| | ], | |
|
||||
| | "safety": [ | |
|
||||
| | "inline::llama-guard" | |
|
||||
| | ], | |
|
||||
| | "memory": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "agents": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ], | |
|
||||
| | "telemetry": [ | |
|
||||
| | "inline::meta-reference" | |
|
||||
| | ] | |
|
||||
| | } | |
|
||||
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
|
||||
```
|
||||
|
||||
You may then pick a template to build your distribution with providers fitted to your liking.
|
||||
|
||||
For example, to build a distribution with TGI as the inference provider, you can run:
|
||||
```
|
||||
llama stack build --template tgi
|
||||
```
|
||||
|
||||
```
|
||||
$ llama stack build --template tgi
|
||||
...
|
||||
You can now edit ~/.llama/distributions/llamastack-tgi/tgi-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-tgi/tgi-run.yaml`
|
||||
```
|
||||
:::
|
||||
|
||||
:::{tab-item} Building from a pre-existing build config file
|
||||
- In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command.
|
||||
|
||||
|
@ -377,6 +181,10 @@ After this step is successful, you should be able to find the built container im
|
|||
Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the `llama stack build` step.
|
||||
|
||||
```
|
||||
# Start using template name
|
||||
llama stack run tgi
|
||||
|
||||
# Start using config file
|
||||
llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml
|
||||
```
|
||||
|
||||
|
@ -412,4 +220,4 @@ INFO: 2401:db00:35c:2d2b:face:0:c9:0:54678 - "GET /models/list HTTP/1.1" 200
|
|||
|
||||
### Troubleshooting
|
||||
|
||||
If you encounter any issues, search through our [GitHub Issues](https://github.com/meta-llama/llama-stack/issues), or file an new issue.
|
||||
If you encounter any issues, ask questions in our discord or search through our [GitHub Issues](https://github.com/meta-llama/llama-stack/issues), or file an new issue.
|
||||
|
|
|
@ -70,20 +70,27 @@ Next up is the most critical part: the set of providers that the stack will use
|
|||
```yaml
|
||||
providers:
|
||||
inference:
|
||||
# provider_id is a string you can choose freely
|
||||
- provider_id: ollama
|
||||
# provider_type is a string that specifies the type of provider.
|
||||
# in this case, the provider for inference is ollama and it is run remotely (outside of the distribution)
|
||||
provider_type: remote::ollama
|
||||
# config is a dictionary that contains the configuration for the provider.
|
||||
# in this case, the configuration is the url of the ollama server
|
||||
config:
|
||||
url: ${env.OLLAMA_URL:http://localhost:11434}
|
||||
```
|
||||
A few things to note:
|
||||
- A _provider instance_ is identified with an (identifier, type, configuration) tuple. The identifier is a string you can choose freely.
|
||||
- A _provider instance_ is identified with an (id, type, configuration) triplet.
|
||||
- The id is a string you can choose freely.
|
||||
- You can instantiate any number of provider instances of the same type.
|
||||
- The configuration dictionary is provider-specific. Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via `llama stack run`), you can specify `--env OLLAMA_URL=http://my-server:11434` to override the default value.
|
||||
- The configuration dictionary is provider-specific.
|
||||
- Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via `llama stack run`), you can specify `--env OLLAMA_URL=http://my-server:11434` to override the default value.
|
||||
|
||||
## Resources
|
||||
```
|
||||
|
||||
Finally, let's look at the `models` section:
|
||||
|
||||
```yaml
|
||||
models:
|
||||
- metadata: {}
|
||||
|
|
|
@ -1,11 +1,20 @@
|
|||
# Using Llama Stack as a Library
|
||||
|
||||
If you are planning to use an external service for Inference (even Ollama or TGI counts as external), it is often easier to use Llama Stack as a library. This avoids the overhead of setting up a server. For [example](https://github.com/meta-llama/llama-stack-client-python/blob/main/src/llama_stack_client/lib/direct/test.py):
|
||||
If you are planning to use an external service for Inference (even Ollama or TGI counts as external), it is often easier to use Llama Stack as a library. This avoids the overhead of setting up a server.
|
||||
```python
|
||||
# setup
|
||||
pip install llama-stack
|
||||
llama stack build --template together --image-type venv
|
||||
```
|
||||
|
||||
```python
|
||||
from llama_stack_client.lib.direct.direct import LlamaStackDirectClient
|
||||
from llama_stack.distribution.library_client import LlamaStackAsLibraryClient
|
||||
|
||||
client = await LlamaStackDirectClient.from_template('ollama')
|
||||
client = LlamaStackAsLibraryClient(
|
||||
"ollama",
|
||||
# provider_data is optional, but if you need to pass in any provider specific data, you can do so here.
|
||||
provider_data = {"tavily_search_api_key": os.environ['TAVILY_SEARCH_API_KEY']}
|
||||
)
|
||||
await client.initialize()
|
||||
```
|
||||
|
||||
|
@ -14,23 +23,12 @@ This will parse your config and set up any inline implementations and remote cli
|
|||
Then, you can access the APIs like `models` and `inference` on the client and call their methods directly:
|
||||
|
||||
```python
|
||||
response = await client.models.list()
|
||||
print(response)
|
||||
```
|
||||
|
||||
```python
|
||||
response = await client.inference.chat_completion(
|
||||
messages=[UserMessage(content="What is the capital of France?", role="user")],
|
||||
model_id="Llama3.1-8B-Instruct",
|
||||
stream=False,
|
||||
)
|
||||
print("\nChat completion response:")
|
||||
print(response)
|
||||
response = client.models.list()
|
||||
```
|
||||
|
||||
If you've created a [custom distribution](https://llama-stack.readthedocs.io/en/latest/distributions/building_distro.html), you can also use the run.yaml configuration file directly:
|
||||
|
||||
```python
|
||||
client = await LlamaStackDirectClient.from_config(config_path)
|
||||
await client.initialize()
|
||||
client = LlamaStackAsLibraryClient(config_path)
|
||||
client.initialize()
|
||||
```
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue