mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-12-25 08:02:00 +00:00
Merge branch 'main' into feature/dpo-training
This commit is contained in:
commit
b68b818539
265 changed files with 10254 additions and 7796 deletions
|
|
@ -7,7 +7,16 @@ Llama Stack supports external providers that live outside of the main codebase.
|
|||
|
||||
## Configuration
|
||||
|
||||
To enable external providers, you need to configure the `external_providers_dir` in your Llama Stack configuration. This directory should contain your external provider specifications:
|
||||
To enable external providers, you need to add `module` into your build yaml, allowing Llama Stack to install the required package corresponding to the external provider.
|
||||
|
||||
an example entry in your build.yaml should look like:
|
||||
|
||||
```
|
||||
- provider_type: remote::ramalama
|
||||
module: ramalama_stack
|
||||
```
|
||||
|
||||
Additionally you can configure the `external_providers_dir` in your Llama Stack configuration. This method is in the process of being deprecated in favor of the `module` method. If using this method, the external provider directory should contain your external provider specifications:
|
||||
|
||||
```yaml
|
||||
external_providers_dir: ~/.llama/providers.d/
|
||||
|
|
@ -112,6 +121,31 @@ container_image: custom-vector-store:latest # optional
|
|||
|
||||
## Required Implementation
|
||||
|
||||
## All Providers
|
||||
|
||||
All providers must contain a `get_provider_spec` function in their `provider` module. This is a standardized structure that Llama Stack expects and is necessary for getting things such as the config class. The `get_provider_spec` method returns a structure identical to the `adapter`. An example function may look like:
|
||||
|
||||
```python
|
||||
from llama_stack.providers.datatypes import (
|
||||
ProviderSpec,
|
||||
Api,
|
||||
AdapterSpec,
|
||||
remote_provider_spec,
|
||||
)
|
||||
|
||||
|
||||
def get_provider_spec() -> ProviderSpec:
|
||||
return remote_provider_spec(
|
||||
api=Api.inference,
|
||||
adapter=AdapterSpec(
|
||||
adapter_type="ramalama",
|
||||
pip_packages=["ramalama>=0.8.5", "pymilvus"],
|
||||
config_class="ramalama_stack.config.RamalamaImplConfig",
|
||||
module="ramalama_stack",
|
||||
),
|
||||
)
|
||||
```
|
||||
|
||||
### Remote Providers
|
||||
|
||||
Remote providers must expose a `get_adapter_impl()` function in their module that takes two arguments:
|
||||
|
|
@ -155,7 +189,7 @@ Version: 0.1.0
|
|||
Location: /path/to/venv/lib/python3.10/site-packages
|
||||
```
|
||||
|
||||
## Example: Custom Ollama Provider
|
||||
## Example using `external_providers_dir`: Custom Ollama Provider
|
||||
|
||||
Here's a complete example of creating and using a custom Ollama provider:
|
||||
|
||||
|
|
@ -206,6 +240,34 @@ external_providers_dir: ~/.llama/providers.d/
|
|||
|
||||
The provider will now be available in Llama Stack with the type `remote::custom_ollama`.
|
||||
|
||||
|
||||
## Example using `module`: ramalama-stack
|
||||
|
||||
[ramalama-stack](https://github.com/containers/ramalama-stack) is a recognized external provider that supports installation via module.
|
||||
|
||||
To install Llama Stack with this external provider a user can provider the following build.yaml:
|
||||
|
||||
```yaml
|
||||
version: 2
|
||||
distribution_spec:
|
||||
description: Use (an external) Ramalama server for running LLM inference
|
||||
container_image: null
|
||||
providers:
|
||||
inference:
|
||||
- provider_type: remote::ramalama
|
||||
module: ramalama_stack==0.3.0a0
|
||||
image_type: venv
|
||||
image_name: null
|
||||
external_providers_dir: null
|
||||
additional_pip_packages:
|
||||
- aiosqlite
|
||||
- sqlalchemy[asyncio]
|
||||
```
|
||||
|
||||
No other steps are required other than `llama stack build` and `llama stack run`. The build process will use `module` to install all of the provider dependencies, retrieve the spec, etc.
|
||||
|
||||
The provider will now be available in Llama Stack with the type `remote::ramalama`.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Package Naming**: Use the prefix `llama-stack-provider-` for your provider packages to make them easily identifiable.
|
||||
|
|
@ -229,9 +291,10 @@ information. Execute the test for the Provider type you are developing.
|
|||
|
||||
If your external provider isn't being loaded:
|
||||
|
||||
1. Check that `module` points to a published pip package with a top level `provider` module including `get_provider_spec`.
|
||||
1. Check that the `external_providers_dir` path is correct and accessible.
|
||||
2. Verify that the YAML files are properly formatted.
|
||||
3. Ensure all required Python packages are installed.
|
||||
4. Check the Llama Stack server logs for any error messages - turn on debug logging to get more
|
||||
information using `LLAMA_STACK_LOGGING=all=debug`.
|
||||
5. Verify that the provider package is installed in your Python environment.
|
||||
5. Verify that the provider package is installed in your Python environment if using `external_providers_dir`.
|
||||
|
|
|
|||
|
|
@ -4,17 +4,13 @@ This section contains documentation for all available providers for the **infere
|
|||
|
||||
- [inline::meta-reference](inline_meta-reference.md)
|
||||
- [inline::sentence-transformers](inline_sentence-transformers.md)
|
||||
- [inline::vllm](inline_vllm.md)
|
||||
- [remote::anthropic](remote_anthropic.md)
|
||||
- [remote::bedrock](remote_bedrock.md)
|
||||
- [remote::cerebras](remote_cerebras.md)
|
||||
- [remote::cerebras-openai-compat](remote_cerebras-openai-compat.md)
|
||||
- [remote::databricks](remote_databricks.md)
|
||||
- [remote::fireworks](remote_fireworks.md)
|
||||
- [remote::fireworks-openai-compat](remote_fireworks-openai-compat.md)
|
||||
- [remote::gemini](remote_gemini.md)
|
||||
- [remote::groq](remote_groq.md)
|
||||
- [remote::groq-openai-compat](remote_groq-openai-compat.md)
|
||||
- [remote::hf::endpoint](remote_hf_endpoint.md)
|
||||
- [remote::hf::serverless](remote_hf_serverless.md)
|
||||
- [remote::llama-openai-compat](remote_llama-openai-compat.md)
|
||||
|
|
@ -24,9 +20,7 @@ This section contains documentation for all available providers for the **infere
|
|||
- [remote::passthrough](remote_passthrough.md)
|
||||
- [remote::runpod](remote_runpod.md)
|
||||
- [remote::sambanova](remote_sambanova.md)
|
||||
- [remote::sambanova-openai-compat](remote_sambanova-openai-compat.md)
|
||||
- [remote::tgi](remote_tgi.md)
|
||||
- [remote::together](remote_together.md)
|
||||
- [remote::together-openai-compat](remote_together-openai-compat.md)
|
||||
- [remote::vllm](remote_vllm.md)
|
||||
- [remote::watsonx](remote_watsonx.md)
|
||||
|
|
@ -1,29 +0,0 @@
|
|||
# inline::vllm
|
||||
|
||||
## Description
|
||||
|
||||
vLLM inference provider for high-performance model serving with PagedAttention and continuous batching.
|
||||
|
||||
## Configuration
|
||||
|
||||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
| `tensor_parallel_size` | `<class 'int'>` | No | 1 | Number of tensor parallel replicas (number of GPUs to use). |
|
||||
| `max_tokens` | `<class 'int'>` | No | 4096 | Maximum number of tokens to generate. |
|
||||
| `max_model_len` | `<class 'int'>` | No | 4096 | Maximum context length to use during serving. |
|
||||
| `max_num_seqs` | `<class 'int'>` | No | 4 | Maximum parallel batch size for generation. |
|
||||
| `enforce_eager` | `<class 'bool'>` | No | False | Whether to use eager mode for inference (otherwise cuda graphs are used). |
|
||||
| `gpu_memory_utilization` | `<class 'float'>` | No | 0.3 | How much GPU memory will be allocated when this provider has finished loading, including memory that was already allocated before loading. |
|
||||
|
||||
## Sample Configuration
|
||||
|
||||
```yaml
|
||||
tensor_parallel_size: ${env.TENSOR_PARALLEL_SIZE:=1}
|
||||
max_tokens: ${env.MAX_TOKENS:=4096}
|
||||
max_model_len: ${env.MAX_MODEL_LEN:=4096}
|
||||
max_num_seqs: ${env.MAX_NUM_SEQS:=4}
|
||||
enforce_eager: ${env.ENFORCE_EAGER:=False}
|
||||
gpu_memory_utilization: ${env.GPU_MEMORY_UTILIZATION:=0.3}
|
||||
|
||||
```
|
||||
|
||||
|
|
@ -13,7 +13,7 @@ Anthropic inference provider for accessing Claude models and Anthropic's AI serv
|
|||
## Sample Configuration
|
||||
|
||||
```yaml
|
||||
api_key: ${env.ANTHROPIC_API_KEY}
|
||||
api_key: ${env.ANTHROPIC_API_KEY:=}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -15,7 +15,7 @@ Cerebras inference provider for running models on Cerebras Cloud platform.
|
|||
|
||||
```yaml
|
||||
base_url: https://api.cerebras.ai
|
||||
api_key: ${env.CEREBRAS_API_KEY}
|
||||
api_key: ${env.CEREBRAS_API_KEY:=}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -14,8 +14,8 @@ Databricks inference provider for running models on Databricks' unified analytic
|
|||
## Sample Configuration
|
||||
|
||||
```yaml
|
||||
url: ${env.DATABRICKS_URL}
|
||||
api_token: ${env.DATABRICKS_API_TOKEN}
|
||||
url: ${env.DATABRICKS_URL:=}
|
||||
api_token: ${env.DATABRICKS_API_TOKEN:=}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -8,6 +8,7 @@ Fireworks AI inference provider for Llama models and other AI models on the Fire
|
|||
|
||||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
| `allowed_models` | `list[str \| None` | No | | List of models that should be registered with the model registry. If None, all models are allowed. |
|
||||
| `url` | `<class 'str'>` | No | https://api.fireworks.ai/inference/v1 | The URL for the Fireworks server |
|
||||
| `api_key` | `pydantic.types.SecretStr \| None` | No | | The Fireworks.ai API Key |
|
||||
|
||||
|
|
@ -15,7 +16,7 @@ Fireworks AI inference provider for Llama models and other AI models on the Fire
|
|||
|
||||
```yaml
|
||||
url: https://api.fireworks.ai/inference/v1
|
||||
api_key: ${env.FIREWORKS_API_KEY}
|
||||
api_key: ${env.FIREWORKS_API_KEY:=}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -13,7 +13,7 @@ Google Gemini inference provider for accessing Gemini models and Google's AI ser
|
|||
## Sample Configuration
|
||||
|
||||
```yaml
|
||||
api_key: ${env.GEMINI_API_KEY}
|
||||
api_key: ${env.GEMINI_API_KEY:=}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -15,7 +15,7 @@ Groq inference provider for ultra-fast inference using Groq's LPU technology.
|
|||
|
||||
```yaml
|
||||
url: https://api.groq.com
|
||||
api_key: ${env.GROQ_API_KEY}
|
||||
api_key: ${env.GROQ_API_KEY:=}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -9,6 +9,7 @@ Ollama inference provider for running local models through the Ollama runtime.
|
|||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
| `url` | `<class 'str'>` | No | http://localhost:11434 | |
|
||||
| `refresh_models` | `<class 'bool'>` | No | False | Whether to refresh models periodically |
|
||||
|
||||
## Sample Configuration
|
||||
|
||||
|
|
|
|||
|
|
@ -9,11 +9,13 @@ OpenAI inference provider for accessing GPT models and other OpenAI services.
|
|||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
| `api_key` | `str \| None` | No | | API key for OpenAI models |
|
||||
| `base_url` | `<class 'str'>` | No | https://api.openai.com/v1 | Base URL for OpenAI API |
|
||||
|
||||
## Sample Configuration
|
||||
|
||||
```yaml
|
||||
api_key: ${env.OPENAI_API_KEY}
|
||||
api_key: ${env.OPENAI_API_KEY:=}
|
||||
base_url: ${env.OPENAI_BASE_URL:=https://api.openai.com/v1}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -15,7 +15,7 @@ SambaNova OpenAI-compatible provider for using SambaNova models with OpenAI API
|
|||
|
||||
```yaml
|
||||
openai_compat_api_base: https://api.sambanova.ai/v1
|
||||
api_key: ${env.SAMBANOVA_API_KEY}
|
||||
api_key: ${env.SAMBANOVA_API_KEY:=}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -15,7 +15,7 @@ SambaNova inference provider for running models on SambaNova's dataflow architec
|
|||
|
||||
```yaml
|
||||
url: https://api.sambanova.ai/v1
|
||||
api_key: ${env.SAMBANOVA_API_KEY}
|
||||
api_key: ${env.SAMBANOVA_API_KEY:=}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -13,7 +13,7 @@ Text Generation Inference (TGI) provider for HuggingFace model serving.
|
|||
## Sample Configuration
|
||||
|
||||
```yaml
|
||||
url: ${env.TGI_URL}
|
||||
url: ${env.TGI_URL:=}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -8,6 +8,7 @@ Together AI inference provider for open-source models and collaborative AI devel
|
|||
|
||||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
| `allowed_models` | `list[str \| None` | No | | List of models that should be registered with the model registry. If None, all models are allowed. |
|
||||
| `url` | `<class 'str'>` | No | https://api.together.xyz/v1 | The URL for the Together AI server |
|
||||
| `api_key` | `pydantic.types.SecretStr \| None` | No | | The Together AI API Key |
|
||||
|
||||
|
|
@ -15,7 +16,7 @@ Together AI inference provider for open-source models and collaborative AI devel
|
|||
|
||||
```yaml
|
||||
url: https://api.together.xyz/v1
|
||||
api_key: ${env.TOGETHER_API_KEY}
|
||||
api_key: ${env.TOGETHER_API_KEY:=}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -12,11 +12,12 @@ Remote vLLM inference provider for connecting to vLLM servers.
|
|||
| `max_tokens` | `<class 'int'>` | No | 4096 | Maximum number of tokens to generate. |
|
||||
| `api_token` | `str \| None` | No | fake | The API token |
|
||||
| `tls_verify` | `bool \| str` | No | True | Whether to verify TLS certificates. Can be a boolean or a path to a CA certificate file. |
|
||||
| `refresh_models` | `<class 'bool'>` | No | False | Whether to refresh models periodically |
|
||||
|
||||
## Sample Configuration
|
||||
|
||||
```yaml
|
||||
url: ${env.VLLM_URL}
|
||||
url: ${env.VLLM_URL:=}
|
||||
max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
|
||||
api_token: ${env.VLLM_API_TOKEN:=fake}
|
||||
tls_verify: ${env.VLLM_TLS_VERIFY:=true}
|
||||
|
|
|
|||
|
|
@ -15,7 +15,7 @@ SambaNova's safety provider for content moderation and safety filtering.
|
|||
|
||||
```yaml
|
||||
url: https://api.sambanova.ai/v1
|
||||
api_key: ${env.SAMBANOVA_API_KEY}
|
||||
api_key: ${env.SAMBANOVA_API_KEY:=}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -42,11 +42,15 @@ See [Chroma's documentation](https://docs.trychroma.com/docs/overview/introducti
|
|||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
| `db_path` | `<class 'str'>` | No | PydanticUndefined | |
|
||||
| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend |
|
||||
|
||||
## Sample Configuration
|
||||
|
||||
```yaml
|
||||
db_path: ${env.CHROMADB_PATH}
|
||||
kvstore:
|
||||
type: sqlite
|
||||
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/chroma_inline_registry.db
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -41,11 +41,15 @@ See [Chroma's documentation](https://docs.trychroma.com/docs/overview/introducti
|
|||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
| `url` | `str \| None` | No | PydanticUndefined | |
|
||||
| `kvstore` | `utils.kvstore.config.RedisKVStoreConfig \| utils.kvstore.config.SqliteKVStoreConfig \| utils.kvstore.config.PostgresKVStoreConfig \| utils.kvstore.config.MongoDBKVStoreConfig` | No | sqlite | Config for KV store backend |
|
||||
|
||||
## Sample Configuration
|
||||
|
||||
```yaml
|
||||
url: ${env.CHROMADB_URL}
|
||||
kvstore:
|
||||
type: sqlite
|
||||
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/dummy}/chroma_remote_registry.db
|
||||
|
||||
```
|
||||
|
||||
|
|
|
|||
|
|
@ -17,7 +17,7 @@ That means you'll get fast and efficient vector retrieval.
|
|||
To use PGVector in your Llama Stack project, follow these steps:
|
||||
|
||||
1. Install the necessary dependencies.
|
||||
2. Configure your Llama Stack project to use Faiss.
|
||||
2. Configure your Llama Stack project to use pgvector. (e.g. remote::pgvector).
|
||||
3. Start storing and querying vectors.
|
||||
|
||||
## Installation
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue