mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-07-07 06:20:45 +00:00
feat: consolidate most distros into "starter" (#2516)
# What does this PR do? * Removes a bunch of distros * Removed distros were added into the "starter" distribution * Doc for "starter" has been added * Partially reverts https://github.com/meta-llama/llama-stack/pull/2482 since inference providers are disabled by default and can be turned on manually via env variable. * Disables safety in starter distro Closes: https://github.com/meta-llama/llama-stack/issues/2502. ~Needs: https://github.com/meta-llama/llama-stack/pull/2482 for Ollama to work properly in the CI.~ TODO: - [ ] We can only update `install.sh` when we get a new release. - [x] Update providers documentation - [ ] Update notebooks to reference starter instead of ollama Signed-off-by: Sébastien Han <seb@redhat.com>
This commit is contained in:
parent
f77d4d91f5
commit
c4349f532b
132 changed files with 1009 additions and 10845 deletions
|
@ -141,9 +141,9 @@ You may then pick a template to build your distribution with providers fitted to
|
|||
|
||||
For example, to build a distribution with TGI as the inference provider, you can run:
|
||||
```
|
||||
$ llama stack build --template tgi
|
||||
$ llama stack build --template starter
|
||||
...
|
||||
You can now edit ~/.llama/distributions/llamastack-tgi/tgi-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-tgi/tgi-run.yaml`
|
||||
You can now edit ~/.llama/distributions/llamastack-starter/starter-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-starter/starter-run.yaml`
|
||||
```
|
||||
:::
|
||||
:::{tab-item} Building from Scratch
|
||||
|
@ -183,26 +183,7 @@ You can now edit ~/.llama/distributions/llamastack-my-local-stack/my-local-stack
|
|||
- The config file will be of contents like the ones in `llama_stack/templates/*build.yaml`.
|
||||
|
||||
```
|
||||
$ cat llama_stack/templates/ollama/build.yaml
|
||||
|
||||
name: ollama
|
||||
distribution_spec:
|
||||
description: Like local, but use ollama for running LLM inference
|
||||
providers:
|
||||
inference: remote::ollama
|
||||
memory: inline::faiss
|
||||
safety: inline::llama-guard
|
||||
agents: inline::meta-reference
|
||||
telemetry: inline::meta-reference
|
||||
image_name: ollama
|
||||
image_type: conda
|
||||
|
||||
# If some providers are external, you can specify the path to the implementation
|
||||
external_providers_dir: ~/.llama/providers.d
|
||||
```
|
||||
|
||||
```
|
||||
llama stack build --config llama_stack/templates/ollama/build.yaml
|
||||
llama stack build --config llama_stack/templates/starter/build.yaml
|
||||
```
|
||||
:::
|
||||
|
||||
|
@ -268,11 +249,11 @@ Podman is supported as an alternative to Docker. Set `CONTAINER_BINARY` to `podm
|
|||
To build a container image, you may start off from a template and use the `--image-type container` flag to specify `container` as the build image type.
|
||||
|
||||
```
|
||||
llama stack build --template ollama --image-type container
|
||||
llama stack build --template starter --image-type container
|
||||
```
|
||||
|
||||
```
|
||||
$ llama stack build --template ollama --image-type container
|
||||
$ llama stack build --template starter --image-type container
|
||||
...
|
||||
Containerfile created successfully in /tmp/tmp.viA3a3Rdsg/ContainerfileFROM python:3.10-slim
|
||||
...
|
||||
|
|
|
@ -6,7 +6,7 @@ This avoids the overhead of setting up a server.
|
|||
```bash
|
||||
# setup
|
||||
uv pip install llama-stack
|
||||
llama stack build --template ollama --image-type venv
|
||||
llama stack build --template starter --image-type venv
|
||||
```
|
||||
|
||||
```python
|
||||
|
|
|
@ -1,51 +1,94 @@
|
|||
# Available List of Distributions
|
||||
# Available Distributions
|
||||
|
||||
Here are a list of distributions you can use to start a Llama Stack server that are provided out of the box.
|
||||
Llama Stack provides several pre-configured distributions to help you get started quickly. Choose the distribution that best fits your hardware and use case.
|
||||
|
||||
## Selection of a Distribution / Template
|
||||
## Quick Reference
|
||||
|
||||
Which templates / distributions to choose depends on the hardware you have for running LLM inference.
|
||||
| Distribution | Use Case | Hardware Requirements | Provider |
|
||||
|--------------|----------|----------------------|----------|
|
||||
| `distribution-starter` | General purpose, prototyping | Any (CPU/GPU) | Ollama, Remote APIs |
|
||||
| `distribution-meta-reference-gpu` | High-performance inference | GPU required | Local GPU inference |
|
||||
| Remote-hosted | Production, managed service | None | Partner providers |
|
||||
| iOS/Android SDK | Mobile applications | Mobile device | On-device inference |
|
||||
|
||||
- **Do you want a hosted Llama Stack endpoint?** If so, we suggest leveraging our partners who host Llama Stack endpoints. Namely, _fireworks.ai_ and _together.xyz_.
|
||||
- Read more about it here - [Remote-Hosted Endpoints](remote_hosted_distro/index).
|
||||
## Choose Your Distribution
|
||||
|
||||
### 🚀 Getting Started (Recommended for Beginners)
|
||||
|
||||
- **Do you have access to machines with GPUs?** If you wish to run Llama Stack locally or on a cloud instance and host your own Llama Stack endpoint, we suggest:
|
||||
- {dockerhub}`distribution-remote-vllm` ([Guide](self_hosted_distro/remote-vllm))
|
||||
- {dockerhub}`distribution-meta-reference-gpu` ([Guide](self_hosted_distro/meta-reference-gpu))
|
||||
- {dockerhub}`distribution-tgi` ([Guide](self_hosted_distro/tgi))
|
||||
- {dockerhub}`distribution-nvidia` ([Guide](self_hosted_distro/nvidia))
|
||||
**Use `distribution-starter` if you want to:**
|
||||
- Prototype quickly without GPU requirements
|
||||
- Use remote inference providers (Fireworks, Together, vLLM etc.)
|
||||
- Run locally with Ollama for development
|
||||
|
||||
- **Are you running on a "regular" desktop or laptop ?** We suggest using the ollama template for quick prototyping and get started without having to worry about needing GPUs.
|
||||
- {dockerhub}`distribution-ollama` ([Guide](self_hosted_distro/ollama))
|
||||
```bash
|
||||
docker pull llama-stack/distribution-starter
|
||||
```
|
||||
|
||||
- **Do you have an API key for a remote inference provider like Fireworks, Together, etc.?** If so, we suggest:
|
||||
- {dockerhub}`distribution-together` ([Guide](self_hosted_distro/together))
|
||||
- {dockerhub}`distribution-fireworks` ([Guide](self_hosted_distro/fireworks))
|
||||
**Guides:** [Starter Distribution Guide](self_hosted_distro/starter)
|
||||
|
||||
- **Do you want to run Llama Stack inference on your iOS / Android device?** Lastly, we also provide templates for running Llama Stack inference on your iOS / Android device:
|
||||
- [iOS SDK](ondevice_distro/ios_sdk)
|
||||
- [Android](ondevice_distro/android_sdk)
|
||||
### 🖥️ Self-Hosted with GPU
|
||||
|
||||
**Use `distribution-meta-reference-gpu` if you:**
|
||||
- Have access to GPU hardware
|
||||
- Want maximum performance and control
|
||||
- Need to run inference locally
|
||||
|
||||
- **If none of the above fit your needs, you can also build your own [custom distribution](building_distro.md).**
|
||||
```bash
|
||||
docker pull llama-stack/distribution-meta-reference-gpu
|
||||
```
|
||||
|
||||
### Distribution Details
|
||||
**Guides:** [Meta Reference GPU Guide](self_hosted_distro/meta-reference-gpu)
|
||||
|
||||
### ☁️ Managed Hosting
|
||||
|
||||
**Use remote-hosted endpoints if you:**
|
||||
- Don't want to manage infrastructure
|
||||
- Need production-ready reliability
|
||||
- Prefer managed services
|
||||
|
||||
**Partners:** [Fireworks.ai](https://fireworks.ai) and [Together.xyz](https://together.xyz)
|
||||
|
||||
**Guides:** [Remote-Hosted Endpoints](remote_hosted_distro/index)
|
||||
|
||||
### 📱 Mobile Development
|
||||
|
||||
**Use mobile SDKs if you:**
|
||||
- Are building iOS or Android applications
|
||||
- Need on-device inference capabilities
|
||||
- Want offline functionality
|
||||
|
||||
- [iOS SDK](ondevice_distro/ios_sdk)
|
||||
- [Android SDK](ondevice_distro/android_sdk)
|
||||
|
||||
### 🔧 Custom Solutions
|
||||
|
||||
**Build your own distribution if:**
|
||||
- None of the above fit your specific needs
|
||||
- You need custom configurations
|
||||
- You want to optimize for your specific use case
|
||||
|
||||
**Guides:** [Building Custom Distributions](building_distro.md)
|
||||
|
||||
## Detailed Documentation
|
||||
|
||||
### Self-Hosted Distributions
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
self_hosted_distro/starter
|
||||
self_hosted_distro/meta-reference-gpu
|
||||
```
|
||||
|
||||
### Remote-Hosted Solutions
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
||||
remote_hosted_distro/index
|
||||
self_hosted_distro/remote-vllm
|
||||
self_hosted_distro/meta-reference-gpu
|
||||
self_hosted_distro/tgi
|
||||
self_hosted_distro/nvidia
|
||||
self_hosted_distro/ollama
|
||||
self_hosted_distro/together
|
||||
self_hosted_distro/fireworks
|
||||
```
|
||||
|
||||
### On-Device Distributions
|
||||
### Mobile SDKs
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 1
|
||||
|
@ -53,3 +96,25 @@ self_hosted_distro/fireworks
|
|||
ondevice_distro/ios_sdk
|
||||
ondevice_distro/android_sdk
|
||||
```
|
||||
|
||||
## Decision Flow
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
A[What's your use case?] --> B{Need mobile app?}
|
||||
B -->|Yes| C[Use Mobile SDKs]
|
||||
B -->|No| D{Have GPU hardware?}
|
||||
D -->|Yes| E[Use Meta Reference GPU]
|
||||
D -->|No| F{Want managed hosting?}
|
||||
F -->|Yes| G[Use Remote-Hosted]
|
||||
F -->|No| H[Use Starter Distribution]
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Choose your distribution** from the options above
|
||||
2. **Follow the setup guide** for your selected distribution
|
||||
3. **Configure your providers** with API keys or local models
|
||||
4. **Start building** with Llama Stack!
|
||||
|
||||
For help choosing or troubleshooting, check our [Getting Started Guide](../getting_started/index.md) or [Community Support](https://github.com/llama-stack/llama-stack/discussions).
|
||||
|
|
|
@ -1,79 +0,0 @@
|
|||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# Bedrock Distribution
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-bedrock` distribution consists of the following provider configurations:
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| inference | `remote::bedrock` |
|
||||
| safety | `remote::bedrock` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
|
||||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
|
||||
### Models
|
||||
|
||||
The following models are available by default:
|
||||
|
||||
- `meta.llama3-1-8b-instruct-v1:0 (aliases: meta-llama/Llama-3.1-8B-Instruct)`
|
||||
- `meta.llama3-1-70b-instruct-v1:0 (aliases: meta-llama/Llama-3.1-70B-Instruct)`
|
||||
- `meta.llama3-1-405b-instruct-v1:0 (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)`
|
||||
|
||||
|
||||
### Prerequisite: API Keys
|
||||
|
||||
Make sure you have access to a AWS Bedrock API Key. You can get one by visiting [AWS Bedrock](https://aws.amazon.com/bedrock/).
|
||||
|
||||
|
||||
## Running Llama Stack with AWS Bedrock
|
||||
|
||||
You can do this via Conda (build code) or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
llamastack/distribution-bedrock \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
|
||||
--env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
|
||||
--env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
|
||||
--env AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION
|
||||
```
|
||||
|
||||
### Via Conda
|
||||
|
||||
```bash
|
||||
llama stack build --template bedrock --image-type conda
|
||||
llama stack run ./run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
|
||||
--env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
|
||||
--env AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
|
||||
--env AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION
|
||||
```
|
|
@ -1,67 +0,0 @@
|
|||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# Cerebras Distribution
|
||||
|
||||
The `llamastack/distribution-cerebras` distribution consists of the following provider configurations.
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| inference | `remote::cerebras`, `inline::sentence-transformers` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime` |
|
||||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `CEREBRAS_API_KEY`: Cerebras API Key (default: ``)
|
||||
|
||||
### Models
|
||||
|
||||
The following models are available by default:
|
||||
|
||||
- `llama3.1-8b (aliases: meta-llama/Llama-3.1-8B-Instruct)`
|
||||
- `llama-3.3-70b (aliases: meta-llama/Llama-3.3-70B-Instruct)`
|
||||
|
||||
|
||||
### Prerequisite: API Keys
|
||||
|
||||
Make sure you have access to a Cerebras API Key. You can get one by visiting [cloud.cerebras.ai](https://cloud.cerebras.ai/).
|
||||
|
||||
|
||||
## Running Llama Stack with Cerebras
|
||||
|
||||
You can do this via Conda (build code) or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ./run.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-cerebras \
|
||||
--config /root/my-run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env CEREBRAS_API_KEY=$CEREBRAS_API_KEY
|
||||
```
|
||||
|
||||
### Via Conda
|
||||
|
||||
```bash
|
||||
llama stack build --template cerebras --image-type conda
|
||||
llama stack run ./run.yaml \
|
||||
--port 8321 \
|
||||
--env CEREBRAS_API_KEY=$CEREBRAS_API_KEY
|
||||
```
|
|
@ -1,86 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# Fireworks Distribution
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-fireworks` distribution consists of the following provider configurations.
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| files | `inline::localfs` |
|
||||
| inference | `remote::fireworks`, `inline::sentence-transformers` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `remote::wolfram-alpha`, `inline::rag-runtime`, `remote::model-context-protocol` |
|
||||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `FIREWORKS_API_KEY`: Fireworks.AI API Key (default: ``)
|
||||
|
||||
### Models
|
||||
|
||||
The following models are available by default:
|
||||
|
||||
- `accounts/fireworks/models/llama-v3p1-8b-instruct (aliases: meta-llama/Llama-3.1-8B-Instruct)`
|
||||
- `accounts/fireworks/models/llama-v3p1-70b-instruct (aliases: meta-llama/Llama-3.1-70B-Instruct)`
|
||||
- `accounts/fireworks/models/llama-v3p1-405b-instruct (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)`
|
||||
- `accounts/fireworks/models/llama-v3p2-3b-instruct (aliases: meta-llama/Llama-3.2-3B-Instruct)`
|
||||
- `accounts/fireworks/models/llama-v3p2-11b-vision-instruct (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)`
|
||||
- `accounts/fireworks/models/llama-v3p2-90b-vision-instruct (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)`
|
||||
- `accounts/fireworks/models/llama-v3p3-70b-instruct (aliases: meta-llama/Llama-3.3-70B-Instruct)`
|
||||
- `accounts/fireworks/models/llama-guard-3-8b (aliases: meta-llama/Llama-Guard-3-8B)`
|
||||
- `accounts/fireworks/models/llama-guard-3-11b-vision (aliases: meta-llama/Llama-Guard-3-11B-Vision)`
|
||||
- `accounts/fireworks/models/llama4-scout-instruct-basic (aliases: meta-llama/Llama-4-Scout-17B-16E-Instruct)`
|
||||
- `accounts/fireworks/models/llama4-maverick-instruct-basic (aliases: meta-llama/Llama-4-Maverick-17B-128E-Instruct)`
|
||||
- `nomic-ai/nomic-embed-text-v1.5 `
|
||||
|
||||
|
||||
### Prerequisite: API Keys
|
||||
|
||||
Make sure you have access to a Fireworks API Key. You can get one by visiting [fireworks.ai](https://fireworks.ai/).
|
||||
|
||||
|
||||
## Running Llama Stack with Fireworks
|
||||
|
||||
You can do this via Conda (build code) or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
llamastack/distribution-fireworks \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env FIREWORKS_API_KEY=$FIREWORKS_API_KEY
|
||||
```
|
||||
|
||||
### Via Conda
|
||||
|
||||
```bash
|
||||
llama stack build --template fireworks --image-type conda
|
||||
llama stack run ./run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env FIREWORKS_API_KEY=$FIREWORKS_API_KEY
|
||||
```
|
|
@ -1,82 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# Groq Distribution
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-groq` distribution consists of the following provider configurations.
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| inference | `remote::groq` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime` |
|
||||
| vector_io | `inline::faiss` |
|
||||
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `GROQ_API_KEY`: Groq API Key (default: ``)
|
||||
|
||||
### Models
|
||||
|
||||
The following models are available by default:
|
||||
|
||||
- `groq/llama3-8b-8192 (aliases: meta-llama/Llama-3.1-8B-Instruct)`
|
||||
- `groq/llama-3.1-8b-instant `
|
||||
- `groq/llama3-70b-8192 (aliases: meta-llama/Llama-3-70B-Instruct)`
|
||||
- `groq/llama-3.3-70b-versatile (aliases: meta-llama/Llama-3.3-70B-Instruct)`
|
||||
- `groq/llama-3.2-3b-preview (aliases: meta-llama/Llama-3.2-3B-Instruct)`
|
||||
- `groq/llama-4-scout-17b-16e-instruct (aliases: meta-llama/Llama-4-Scout-17B-16E-Instruct)`
|
||||
- `groq/meta-llama/llama-4-scout-17b-16e-instruct (aliases: meta-llama/Llama-4-Scout-17B-16E-Instruct)`
|
||||
- `groq/llama-4-maverick-17b-128e-instruct (aliases: meta-llama/Llama-4-Maverick-17B-128E-Instruct)`
|
||||
- `groq/meta-llama/llama-4-maverick-17b-128e-instruct (aliases: meta-llama/Llama-4-Maverick-17B-128E-Instruct)`
|
||||
|
||||
|
||||
### Prerequisite: API Keys
|
||||
|
||||
Make sure you have access to a Groq API Key. You can get one by visiting [Groq](https://api.groq.com/).
|
||||
|
||||
|
||||
## Running Llama Stack with Groq
|
||||
|
||||
You can do this via Conda (build code) or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
llamastack/distribution-groq \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env GROQ_API_KEY=$GROQ_API_KEY
|
||||
```
|
||||
|
||||
### Via Conda
|
||||
|
||||
```bash
|
||||
llama stack build --template groq --image-type conda
|
||||
llama stack run ./run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env GROQ_API_KEY=$GROQ_API_KEY
|
||||
```
|
|
@ -1,177 +0,0 @@
|
|||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# NVIDIA Distribution
|
||||
|
||||
The `llamastack/distribution-nvidia` distribution consists of the following provider configurations.
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `inline::localfs`, `remote::nvidia` |
|
||||
| eval | `remote::nvidia` |
|
||||
| inference | `remote::nvidia` |
|
||||
| post_training | `remote::nvidia` |
|
||||
| safety | `remote::nvidia` |
|
||||
| scoring | `inline::basic` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `inline::rag-runtime` |
|
||||
| vector_io | `inline::faiss` |
|
||||
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `NVIDIA_API_KEY`: NVIDIA API Key (default: ``)
|
||||
- `NVIDIA_APPEND_API_VERSION`: Whether to append the API version to the base_url (default: `True`)
|
||||
- `NVIDIA_DATASET_NAMESPACE`: NVIDIA Dataset Namespace (default: `default`)
|
||||
- `NVIDIA_PROJECT_ID`: NVIDIA Project ID (default: `test-project`)
|
||||
- `NVIDIA_CUSTOMIZER_URL`: NVIDIA Customizer URL (default: `https://customizer.api.nvidia.com`)
|
||||
- `NVIDIA_OUTPUT_MODEL_DIR`: NVIDIA Output Model Directory (default: `test-example-model@v1`)
|
||||
- `GUARDRAILS_SERVICE_URL`: URL for the NeMo Guardrails Service (default: `http://0.0.0.0:7331`)
|
||||
- `NVIDIA_GUARDRAILS_CONFIG_ID`: NVIDIA Guardrail Configuration ID (default: `self-check`)
|
||||
- `NVIDIA_EVALUATOR_URL`: URL for the NeMo Evaluator Service (default: `http://0.0.0.0:7331`)
|
||||
- `INFERENCE_MODEL`: Inference model (default: `Llama3.1-8B-Instruct`)
|
||||
- `SAFETY_MODEL`: Name of the model to use for safety (default: `meta/llama-3.1-8b-instruct`)
|
||||
|
||||
### Models
|
||||
|
||||
The following models are available by default:
|
||||
|
||||
- `meta/llama3-8b-instruct (aliases: meta-llama/Llama-3-8B-Instruct)`
|
||||
- `meta/llama3-70b-instruct (aliases: meta-llama/Llama-3-70B-Instruct)`
|
||||
- `meta/llama-3.1-8b-instruct (aliases: meta-llama/Llama-3.1-8B-Instruct)`
|
||||
- `meta/llama-3.1-70b-instruct (aliases: meta-llama/Llama-3.1-70B-Instruct)`
|
||||
- `meta/llama-3.1-405b-instruct (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)`
|
||||
- `meta/llama-3.2-1b-instruct (aliases: meta-llama/Llama-3.2-1B-Instruct)`
|
||||
- `meta/llama-3.2-3b-instruct (aliases: meta-llama/Llama-3.2-3B-Instruct)`
|
||||
- `meta/llama-3.2-11b-vision-instruct (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)`
|
||||
- `meta/llama-3.2-90b-vision-instruct (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)`
|
||||
- `meta/llama-3.3-70b-instruct (aliases: meta-llama/Llama-3.3-70B-Instruct)`
|
||||
- `nvidia/llama-3.2-nv-embedqa-1b-v2 `
|
||||
- `nvidia/nv-embedqa-e5-v5 `
|
||||
- `nvidia/nv-embedqa-mistral-7b-v2 `
|
||||
- `snowflake/arctic-embed-l `
|
||||
|
||||
|
||||
## Prerequisites
|
||||
### NVIDIA API Keys
|
||||
|
||||
Make sure you have access to a NVIDIA API Key. You can get one by visiting [https://build.nvidia.com/](https://build.nvidia.com/). Use this key for the `NVIDIA_API_KEY` environment variable.
|
||||
|
||||
### Deploy NeMo Microservices Platform
|
||||
The NVIDIA NeMo microservices platform supports end-to-end microservice deployment of a complete AI flywheel on your Kubernetes cluster through the NeMo Microservices Helm Chart. Please reference the [NVIDIA NeMo Microservices documentation](https://docs.nvidia.com/nemo/microservices/latest/about/index.html) for platform prerequisites and instructions to install and deploy the platform.
|
||||
|
||||
## Supported Services
|
||||
Each Llama Stack API corresponds to a specific NeMo microservice. The core microservices (Customizer, Evaluator, Guardrails) are exposed by the same endpoint. The platform components (Data Store) are each exposed by separate endpoints.
|
||||
|
||||
### Inference: NVIDIA NIM
|
||||
NVIDIA NIM is used for running inference with registered models. There are two ways to access NVIDIA NIMs:
|
||||
1. Hosted (default): Preview APIs hosted at https://integrate.api.nvidia.com (Requires an API key)
|
||||
2. Self-hosted: NVIDIA NIMs that run on your own infrastructure.
|
||||
|
||||
The deployed platform includes the NIM Proxy microservice, which is the service that provides to access your NIMs (for example, to run inference on a model). Set the `NVIDIA_BASE_URL` environment variable to use your NVIDIA NIM Proxy deployment.
|
||||
|
||||
### Datasetio API: NeMo Data Store
|
||||
The NeMo Data Store microservice serves as the default file storage solution for the NeMo microservices platform. It exposts APIs compatible with the Hugging Face Hub client (`HfApi`), so you can use the client to interact with Data Store. The `NVIDIA_DATASETS_URL` environment variable should point to your NeMo Data Store endpoint.
|
||||
|
||||
See the {repopath}`NVIDIA Datasetio docs::llama_stack/providers/remote/datasetio/nvidia/README.md` for supported features and example usage.
|
||||
|
||||
### Eval API: NeMo Evaluator
|
||||
The NeMo Evaluator microservice supports evaluation of LLMs. Launching an Evaluation job with NeMo Evaluator requires an Evaluation Config (an object that contains metadata needed by the job). A Llama Stack Benchmark maps to an Evaluation Config, so registering a Benchmark creates an Evaluation Config in NeMo Evaluator. The `NVIDIA_EVALUATOR_URL` environment variable should point to your NeMo Microservices endpoint.
|
||||
|
||||
See the {repopath}`NVIDIA Eval docs::llama_stack/providers/remote/eval/nvidia/README.md` for supported features and example usage.
|
||||
|
||||
### Post-Training API: NeMo Customizer
|
||||
The NeMo Customizer microservice supports fine-tuning models. You can reference {repopath}`this list of supported models::llama_stack/providers/remote/post_training/nvidia/models.py` that can be fine-tuned using Llama Stack. The `NVIDIA_CUSTOMIZER_URL` environment variable should point to your NeMo Microservices endpoint.
|
||||
|
||||
See the {repopath}`NVIDIA Post-Training docs::llama_stack/providers/remote/post_training/nvidia/README.md` for supported features and example usage.
|
||||
|
||||
### Safety API: NeMo Guardrails
|
||||
The NeMo Guardrails microservice sits between your application and the LLM, and adds checks and content moderation to a model. The `GUARDRAILS_SERVICE_URL` environment variable should point to your NeMo Microservices endpoint.
|
||||
|
||||
See the {repopath}`NVIDIA Safety docs::llama_stack/providers/remote/safety/nvidia/README.md` for supported features and example usage.
|
||||
|
||||
## Deploying models
|
||||
In order to use a registered model with the Llama Stack APIs, ensure the corresponding NIM is deployed to your environment. For example, you can use the NIM Proxy microservice to deploy `meta/llama-3.2-1b-instruct`.
|
||||
|
||||
Note: For improved inference speeds, we need to use NIM with `fast_outlines` guided decoding system (specified in the request body). This is the default if you deployed the platform with the NeMo Microservices Helm Chart.
|
||||
```sh
|
||||
# URL to NeMo NIM Proxy service
|
||||
export NEMO_URL="http://nemo.test"
|
||||
|
||||
curl --location "$NEMO_URL/v1/deployment/model-deployments" \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"name": "llama-3.2-1b-instruct",
|
||||
"namespace": "meta",
|
||||
"config": {
|
||||
"model": "meta/llama-3.2-1b-instruct",
|
||||
"nim_deployment": {
|
||||
"image_name": "nvcr.io/nim/meta/llama-3.2-1b-instruct",
|
||||
"image_tag": "1.8.3",
|
||||
"pvc_size": "25Gi",
|
||||
"gpu": 1,
|
||||
"additional_envs": {
|
||||
"NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
|
||||
}
|
||||
}
|
||||
}
|
||||
}'
|
||||
```
|
||||
This NIM deployment should take approximately 10 minutes to go live. [See the docs](https://docs.nvidia.com/nemo/microservices/latest/get-started/tutorials/deploy-nims.html) for more information on how to deploy a NIM and verify it's available for inference.
|
||||
|
||||
You can also remove a deployed NIM to free up GPU resources, if needed.
|
||||
```sh
|
||||
export NEMO_URL="http://nemo.test"
|
||||
|
||||
curl -X DELETE "$NEMO_URL/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct"
|
||||
```
|
||||
|
||||
## Running Llama Stack with NVIDIA
|
||||
|
||||
You can do this via Conda or venv (build code), or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ./run.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-nvidia \
|
||||
--config /root/my-run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env NVIDIA_API_KEY=$NVIDIA_API_KEY
|
||||
```
|
||||
|
||||
### Via Conda
|
||||
|
||||
```bash
|
||||
INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct
|
||||
llama stack build --template nvidia --image-type conda
|
||||
llama stack run ./run.yaml \
|
||||
--port 8321 \
|
||||
--env NVIDIA_API_KEY=$NVIDIA_API_KEY \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL
|
||||
```
|
||||
|
||||
### Via venv
|
||||
|
||||
If you've set up your local development environment, you can also build the image using your local virtual environment.
|
||||
|
||||
```bash
|
||||
INFERENCE_MODEL=meta-llama/Llama-3.1-8b-Instruct
|
||||
llama stack build --template nvidia --image-type venv
|
||||
llama stack run ./run.yaml \
|
||||
--port 8321 \
|
||||
--env NVIDIA_API_KEY=$NVIDIA_API_KEY \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL
|
||||
```
|
||||
|
||||
## Example Notebooks
|
||||
For examples of how to use the NVIDIA Distribution to run inference, fine-tune, evaluate, and run safety checks on your LLMs, you can reference the example notebooks in {repopath}`docs/notebooks/nvidia`.
|
|
@ -1,165 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# Ollama Distribution
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-ollama` distribution consists of the following provider configurations.
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| files | `inline::localfs` |
|
||||
| inference | `remote::ollama` |
|
||||
| post_training | `inline::huggingface` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol`, `remote::wolfram-alpha` |
|
||||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
|
||||
You should use this distribution if you have a regular desktop machine without very powerful GPUs. Of course, if you have powerful GPUs, you can still continue using this distribution since Ollama supports GPU acceleration.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `OLLAMA_URL`: URL of the Ollama server (default: `http://127.0.0.1:11434`)
|
||||
- `INFERENCE_MODEL`: Inference model loaded into the Ollama server (default: `meta-llama/Llama-3.2-3B-Instruct`)
|
||||
- `SAFETY_MODEL`: Safety model loaded into the Ollama server (default: `meta-llama/Llama-Guard-3-1B`)
|
||||
|
||||
|
||||
## Setting up Ollama server
|
||||
|
||||
Please check the [Ollama Documentation](https://github.com/ollama/ollama) on how to install and run Ollama. After installing Ollama, you need to run `ollama serve` to start the server.
|
||||
|
||||
In order to load models, you can run:
|
||||
|
||||
```bash
|
||||
export INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct"
|
||||
|
||||
# ollama names this model differently, and we must use the ollama name when loading the model
|
||||
export OLLAMA_INFERENCE_MODEL="llama3.2:3b-instruct-fp16"
|
||||
ollama run $OLLAMA_INFERENCE_MODEL --keepalive 60m
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, you will also need to pull and run the safety model.
|
||||
|
||||
```bash
|
||||
export SAFETY_MODEL="meta-llama/Llama-Guard-3-1B"
|
||||
|
||||
# ollama names this model differently, and we must use the ollama name when loading the model
|
||||
export OLLAMA_SAFETY_MODEL="llama-guard3:1b"
|
||||
ollama run $OLLAMA_SAFETY_MODEL --keepalive 60m
|
||||
```
|
||||
|
||||
## Running Llama Stack
|
||||
|
||||
Now you are ready to run Llama Stack with Ollama as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
export LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
llamastack/distribution-ollama \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env OLLAMA_URL=http://host.docker.internal:11434
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, use:
|
||||
|
||||
```bash
|
||||
# You need a local checkout of llama-stack to run this, get it using
|
||||
# git clone https://github.com/meta-llama/llama-stack.git
|
||||
cd /path/to/llama-stack
|
||||
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
-v ./llama_stack/templates/ollama/run-with-safety.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-ollama \
|
||||
--config /root/my-run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env SAFETY_MODEL=$SAFETY_MODEL \
|
||||
--env OLLAMA_URL=http://host.docker.internal:11434
|
||||
```
|
||||
|
||||
### Via Conda
|
||||
|
||||
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
|
||||
|
||||
```bash
|
||||
export LLAMA_STACK_PORT=8321
|
||||
|
||||
llama stack build --template ollama --image-type conda
|
||||
llama stack run ./run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env OLLAMA_URL=http://localhost:11434
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, use:
|
||||
|
||||
```bash
|
||||
llama stack run ./run-with-safety.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env SAFETY_MODEL=$SAFETY_MODEL \
|
||||
--env OLLAMA_URL=http://localhost:11434
|
||||
```
|
||||
|
||||
|
||||
### (Optional) Update Model Serving Configuration
|
||||
|
||||
```{note}
|
||||
Please check the [model_entries](https://github.com/meta-llama/llama-stack/blob/main/llama_stack/providers/remote/inference/ollama/models.py) for the supported Ollama models.
|
||||
```
|
||||
|
||||
To serve a new model with `ollama`
|
||||
```bash
|
||||
ollama run <model_name>
|
||||
```
|
||||
|
||||
To make sure that the model is being served correctly, run `ollama ps` to get a list of models being served by ollama.
|
||||
```
|
||||
$ ollama ps
|
||||
NAME ID SIZE PROCESSOR UNTIL
|
||||
llama3.2:3b-instruct-fp16 195a8c01d91e 8.6 GB 100% GPU 9 minutes from now
|
||||
```
|
||||
|
||||
To verify that the model served by ollama is correctly connected to Llama Stack server
|
||||
```bash
|
||||
$ llama-stack-client models list
|
||||
|
||||
Available Models
|
||||
|
||||
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┓
|
||||
┃ model_type ┃ identifier ┃ provider_resource_id ┃ metadata ┃ provider_id ┃
|
||||
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━┩
|
||||
│ llm │ meta-llama/Llama-3.2-3B-Instruct │ llama3.2:3b-instruct-fp16 │ │ ollama │
|
||||
└──────────────┴──────────────────────────────────────┴──────────────────────────────┴───────────┴─────────────┘
|
||||
|
||||
Total models: 1
|
||||
```
|
|
@ -1,297 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# Remote vLLM Distribution
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-remote-vllm` distribution consists of the following provider configurations:
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| inference | `remote::vllm`, `inline::sentence-transformers` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol`, `remote::wolfram-alpha` |
|
||||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
|
||||
You can use this distribution if you want to run an independent vLLM server for inference.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `INFERENCE_MODEL`: Inference model loaded into the vLLM server (default: `meta-llama/Llama-3.2-3B-Instruct`)
|
||||
- `VLLM_URL`: URL of the vLLM server with the main inference model (default: `http://host.docker.internal:5100/v1`)
|
||||
- `MAX_TOKENS`: Maximum number of tokens for generation (default: `4096`)
|
||||
- `SAFETY_VLLM_URL`: URL of the vLLM server with the safety model (default: `http://host.docker.internal:5101/v1`)
|
||||
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
|
||||
|
||||
|
||||
## Setting up vLLM server
|
||||
|
||||
In the following sections, we'll use AMD, NVIDIA or Intel GPUs to serve as hardware accelerators for the vLLM
|
||||
server, which acts as both the LLM inference provider and the safety provider. Note that vLLM also
|
||||
[supports many other hardware accelerators](https://docs.vllm.ai/en/latest/getting_started/installation.html) and
|
||||
that we only use GPUs here for demonstration purposes. Note that if you run into issues, you can include the environment variable `--env VLLM_DEBUG_LOG_API_SERVER_RESPONSE=true` (available in vLLM v0.8.3 and above) in the `docker run` command to enable log response from API server for debugging.
|
||||
|
||||
### Setting up vLLM server on AMD GPU
|
||||
|
||||
AMD provides two main vLLM container options:
|
||||
- rocm/vllm: Production-ready container
|
||||
- rocm/vllm-dev: Development container with the latest vLLM features
|
||||
|
||||
Please check the [Blog about ROCm vLLM Usage](https://rocm.blogs.amd.com/software-tools-optimization/vllm-container/README.html) to get more details.
|
||||
|
||||
Here is a sample script to start a ROCm vLLM server locally via Docker:
|
||||
|
||||
```bash
|
||||
export INFERENCE_PORT=8000
|
||||
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
export VLLM_DIMG="rocm/vllm-dev:main"
|
||||
|
||||
docker run \
|
||||
--pull always \
|
||||
--ipc=host \
|
||||
--privileged \
|
||||
--shm-size 16g \
|
||||
--device=/dev/kfd \
|
||||
--device=/dev/dri \
|
||||
--group-add video \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--cap-add=CAP_SYS_ADMIN \
|
||||
--security-opt seccomp=unconfined \
|
||||
--security-opt apparmor=unconfined \
|
||||
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
||||
--env "HIP_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" \
|
||||
-p $INFERENCE_PORT:$INFERENCE_PORT \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
$VLLM_DIMG \
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model $INFERENCE_MODEL \
|
||||
--port $INFERENCE_PORT
|
||||
```
|
||||
|
||||
Note that you'll also need to set `--enable-auto-tool-choice` and `--tool-call-parser` to [enable tool calling in vLLM](https://docs.vllm.ai/en/latest/features/tool_calling.html).
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
|
||||
|
||||
```bash
|
||||
export SAFETY_PORT=8081
|
||||
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
||||
export CUDA_VISIBLE_DEVICES=1
|
||||
export VLLM_DIMG="rocm/vllm-dev:main"
|
||||
|
||||
docker run \
|
||||
--pull always \
|
||||
--ipc=host \
|
||||
--privileged \
|
||||
--shm-size 16g \
|
||||
--device=/dev/kfd \
|
||||
--device=/dev/dri \
|
||||
--group-add video \
|
||||
--cap-add=SYS_PTRACE \
|
||||
--cap-add=CAP_SYS_ADMIN \
|
||||
--security-opt seccomp=unconfined \
|
||||
--security-opt apparmor=unconfined \
|
||||
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
||||
--env "HIP_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" \
|
||||
-p $SAFETY_PORT:$SAFETY_PORT \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
$VLLM_DIMG \
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model $SAFETY_MODEL \
|
||||
--port $SAFETY_PORT
|
||||
```
|
||||
|
||||
### Setting up vLLM server on NVIDIA GPU
|
||||
|
||||
Please check the [vLLM Documentation](https://docs.vllm.ai/en/v0.5.5/serving/deploying_with_docker.html) to get a vLLM endpoint. Here is a sample script to start a vLLM server locally via Docker:
|
||||
|
||||
```bash
|
||||
export INFERENCE_PORT=8000
|
||||
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
|
||||
docker run \
|
||||
--pull always \
|
||||
--runtime nvidia \
|
||||
--gpus $CUDA_VISIBLE_DEVICES \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
||||
-p $INFERENCE_PORT:$INFERENCE_PORT \
|
||||
--ipc=host \
|
||||
vllm/vllm-openai:latest \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--model $INFERENCE_MODEL \
|
||||
--port $INFERENCE_PORT
|
||||
```
|
||||
|
||||
Note that you'll also need to set `--enable-auto-tool-choice` and `--tool-call-parser` to [enable tool calling in vLLM](https://docs.vllm.ai/en/latest/features/tool_calling.html).
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
|
||||
|
||||
```bash
|
||||
export SAFETY_PORT=8081
|
||||
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
||||
export CUDA_VISIBLE_DEVICES=1
|
||||
|
||||
docker run \
|
||||
--pull always \
|
||||
--runtime nvidia \
|
||||
--gpus $CUDA_VISIBLE_DEVICES \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
||||
-p $SAFETY_PORT:$SAFETY_PORT \
|
||||
--ipc=host \
|
||||
vllm/vllm-openai:latest \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--model $SAFETY_MODEL \
|
||||
--port $SAFETY_PORT
|
||||
```
|
||||
|
||||
### Setting up vLLM server on Intel GPU
|
||||
|
||||
Refer to [vLLM Documentation for XPU](https://docs.vllm.ai/en/v0.8.2/getting_started/installation/gpu.html?device=xpu) to get a vLLM endpoint. In addition to vLLM side setup which guides towards installing vLLM from sources orself-building vLLM Docker container, Intel provides prebuilt vLLM container to use on systems with Intel GPUs supported by PyTorch XPU backend:
|
||||
- [intel/vllm](https://hub.docker.com/r/intel/vllm)
|
||||
|
||||
Here is a sample script to start a vLLM server locally via Docker using Intel provided container:
|
||||
|
||||
```bash
|
||||
export INFERENCE_PORT=8000
|
||||
export INFERENCE_MODEL=meta-llama/Llama-3.2-1B-Instruct
|
||||
export ZE_AFFINITY_MASK=0
|
||||
|
||||
docker run \
|
||||
--pull always \
|
||||
--device /dev/dri \
|
||||
-v /dev/dri/by-path:/dev/dri/by-path \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
||||
--env ZE_AFFINITY_MASK=$ZE_AFFINITY_MASK \
|
||||
-p $INFERENCE_PORT:$INFERENCE_PORT \
|
||||
--ipc=host \
|
||||
intel/vllm:xpu \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--model $INFERENCE_MODEL \
|
||||
--port $INFERENCE_PORT
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a vLLM with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
|
||||
|
||||
```bash
|
||||
export SAFETY_PORT=8081
|
||||
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
||||
export ZE_AFFINITY_MASK=1
|
||||
|
||||
docker run \
|
||||
--pull always \
|
||||
--device /dev/dri \
|
||||
-v /dev/dri/by-path:/dev/dri/by-path \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
|
||||
--env ZE_AFFINITY_MASK=$ZE_AFFINITY_MASK \
|
||||
-p $SAFETY_PORT:$SAFETY_PORT \
|
||||
--ipc=host \
|
||||
intel/vllm:xpu \
|
||||
--gpu-memory-utilization 0.7 \
|
||||
--model $SAFETY_MODEL \
|
||||
--port $SAFETY_PORT
|
||||
```
|
||||
|
||||
## Running Llama Stack
|
||||
|
||||
Now you are ready to run Llama Stack with vLLM as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
export INFERENCE_PORT=8000
|
||||
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
||||
export LLAMA_STACK_PORT=8321
|
||||
|
||||
# You need a local checkout of llama-stack to run this, get it using
|
||||
# git clone https://github.com/meta-llama/llama-stack.git
|
||||
cd /path/to/llama-stack
|
||||
|
||||
docker run \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ./llama_stack/templates/remote-vllm/run.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-remote-vllm \
|
||||
--config /root/my-run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env VLLM_URL=http://host.docker.internal:$INFERENCE_PORT/v1
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, use:
|
||||
|
||||
```bash
|
||||
export SAFETY_PORT=8081
|
||||
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
||||
|
||||
# You need a local checkout of llama-stack to run this, get it using
|
||||
# git clone https://github.com/meta-llama/llama-stack.git
|
||||
cd /path/to/llama-stack
|
||||
|
||||
docker run \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
-v ./llama_stack/templates/remote-vllm/run-with-safety.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-remote-vllm \
|
||||
--config /root/my-run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env VLLM_URL=http://host.docker.internal:$INFERENCE_PORT/v1 \
|
||||
--env SAFETY_MODEL=$SAFETY_MODEL \
|
||||
--env SAFETY_VLLM_URL=http://host.docker.internal:$SAFETY_PORT/v1
|
||||
```
|
||||
|
||||
|
||||
### Via Conda
|
||||
|
||||
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
|
||||
|
||||
```bash
|
||||
export INFERENCE_PORT=8000
|
||||
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
||||
export LLAMA_STACK_PORT=8321
|
||||
|
||||
cd distributions/remote-vllm
|
||||
llama stack build --template remote-vllm --image-type conda
|
||||
|
||||
llama stack run ./run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env VLLM_URL=http://localhost:$INFERENCE_PORT/v1
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, use:
|
||||
|
||||
```bash
|
||||
export SAFETY_PORT=8081
|
||||
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
||||
|
||||
llama stack run ./run-with-safety.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env VLLM_URL=http://localhost:$INFERENCE_PORT/v1 \
|
||||
--env SAFETY_MODEL=$SAFETY_MODEL \
|
||||
--env SAFETY_VLLM_URL=http://localhost:$SAFETY_PORT/v1
|
||||
```
|
|
@ -1,91 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# SambaNova Distribution
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-sambanova` distribution consists of the following provider configurations.
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| inference | `remote::sambanova`, `inline::sentence-transformers` |
|
||||
| safety | `remote::sambanova` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol`, `remote::wolfram-alpha` |
|
||||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMASTACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `SAMBANOVA_API_KEY`: SambaNova API Key (default: ``)
|
||||
|
||||
### Models
|
||||
|
||||
The following models are available by default:
|
||||
|
||||
- `sambanova/Meta-Llama-3.1-8B-Instruct (aliases: meta-llama/Llama-3.1-8B-Instruct)`
|
||||
- `sambanova/Meta-Llama-3.1-405B-Instruct (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)`
|
||||
- `sambanova/Meta-Llama-3.2-1B-Instruct (aliases: meta-llama/Llama-3.2-1B-Instruct)`
|
||||
- `sambanova/Meta-Llama-3.2-3B-Instruct (aliases: meta-llama/Llama-3.2-3B-Instruct)`
|
||||
- `sambanova/Meta-Llama-3.3-70B-Instruct (aliases: meta-llama/Llama-3.3-70B-Instruct)`
|
||||
- `sambanova/Llama-3.2-11B-Vision-Instruct (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)`
|
||||
- `sambanova/Llama-3.2-90B-Vision-Instruct (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)`
|
||||
- `sambanova/Llama-4-Scout-17B-16E-Instruct (aliases: meta-llama/Llama-4-Scout-17B-16E-Instruct)`
|
||||
- `sambanova/Llama-4-Maverick-17B-128E-Instruct (aliases: meta-llama/Llama-4-Maverick-17B-128E-Instruct)`
|
||||
- `sambanova/Meta-Llama-Guard-3-8B (aliases: meta-llama/Llama-Guard-3-8B)`
|
||||
|
||||
|
||||
### Prerequisite: API Keys
|
||||
|
||||
Make sure you have access to a SambaNova API Key. You can get one by visiting [SambaNova.ai](http://cloud.sambanova.ai?utm_source=llamastack&utm_medium=external&utm_campaign=cloud_signup).
|
||||
|
||||
|
||||
## Running Llama Stack with SambaNova
|
||||
|
||||
You can do this via Conda (build code) or Docker which has a pre-built image.
|
||||
|
||||
|
||||
### Via Docker
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=8321
|
||||
llama stack build --template sambanova --image-type container
|
||||
docker run \
|
||||
-it \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
distribution-sambanova \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env SAMBANOVA_API_KEY=$SAMBANOVA_API_KEY
|
||||
```
|
||||
|
||||
|
||||
### Via Venv
|
||||
|
||||
```bash
|
||||
llama stack build --template sambanova --image-type venv
|
||||
llama stack run --image-type venv ~/.llama/distributions/sambanova/sambanova-run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env SAMBANOVA_API_KEY=$SAMBANOVA_API_KEY
|
||||
```
|
||||
|
||||
|
||||
### Via Conda
|
||||
|
||||
```bash
|
||||
llama stack build --template sambanova --image-type conda
|
||||
llama stack run --image-type conda ~/.llama/distributions/sambanova/sambanova-run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env SAMBANOVA_API_KEY=$SAMBANOVA_API_KEY
|
||||
```
|
259
docs/source/distributions/self_hosted_distro/starter.md
Normal file
259
docs/source/distributions/self_hosted_distro/starter.md
Normal file
|
@ -0,0 +1,259 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# Starter Distribution
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-starter` distribution is a comprehensive, multi-provider distribution that includes most of the available inference providers in Llama Stack. It's designed to be a one-stop solution for developers who want to experiment with different AI providers without having to configure each one individually.
|
||||
|
||||
## Provider Composition
|
||||
|
||||
The starter distribution consists of the following provider configurations:
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| files | `inline::localfs` |
|
||||
| inference | `remote::openai`, `remote::fireworks`, `remote::together`, `remote::ollama`, `remote::anthropic`, `remote::gemini`, `remote::groq`, `remote::sambanova`, `remote::vllm`, `remote::tgi`, `remote::cerebras`, `remote::llama-openai-compat`, `remote::nvidia`, `remote::hf::serverless`, `remote::hf::endpoint`, `inline::sentence-transformers` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
|
||||
| vector_io | `inline::faiss`, `inline::sqlite-vec`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
## Inference Providers
|
||||
|
||||
The starter distribution includes a comprehensive set of inference providers:
|
||||
|
||||
### Hosted Providers
|
||||
- **[OpenAI](https://openai.com/api/)**: GPT-4, GPT-3.5, O1, O3, O4 models and text embeddings -
|
||||
provider ID: `openai` - reference documentation: [openai](../../providers/inference/remote_openai.md)
|
||||
- **[Fireworks](https://fireworks.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and
|
||||
embeddings - provider ID: `fireworks` - reference documentation: [fireworks](../../providers/inference/remote_fireworks.md)
|
||||
- **[Together](https://together.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and
|
||||
embeddings - provider ID: `together` - reference documentation: [together](../../providers/inference/remote_together.md)
|
||||
- **[Anthropic](https://www.anthropic.com/)**: Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Haiku, and Voyage embeddings - provider ID: `anthropic` - reference documentation: [anthropic](../../providers/inference/remote_anthropic.md)
|
||||
- **[Gemini](https://gemini.google.com/)**: Gemini 1.5, 2.0, 2.5 models and text embeddings - provider ID: `gemini` - reference documentation: [gemini](../../providers/inference/remote_gemini.md)
|
||||
- **[Groq](https://groq.com/)**: Fast Llama models (3.1, 3.2, 3.3, 4 Scout, 4 Maverick) - provider ID: `groq` - reference documentation: [groq](../../providers/inference/remote_groq.md)
|
||||
- **[SambaNova](https://www.sambanova.ai/)**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models - provider ID: `sambanova` - reference documentation: [sambanova](../../providers/inference/remote_sambanova.md)
|
||||
- **[Cerebras](https://www.cerebras.ai/)**: Cerebras AI models - provider ID: `cerebras` - reference documentation: [cerebras](../../providers/inference/remote_cerebras.md)
|
||||
- **[NVIDIA](https://www.nvidia.com/)**: NVIDIA NIM - provider ID: `nvidia` - reference documentation: [nvidia](../../providers/inference/remote_nvidia.md)
|
||||
- **[HuggingFace](https://huggingface.co/)**: Serverless and endpoint models - provider ID: `hf::serverless` and `hf::endpoint` - reference documentation: [huggingface-serverless](../../providers/inference/remote_hf_serverless.md) and [huggingface-endpoint](../../providers/inference/remote_hf_endpoint.md)
|
||||
- **[Bedrock](https://aws.amazon.com/bedrock/)**: AWS Bedrock models - provider ID: `bedrock` - reference documentation: [bedrock](../../providers/inference/remote_bedrock.md)
|
||||
|
||||
### Local/Remote Providers
|
||||
- **[Ollama](https://ollama.ai/)**: Local Ollama models - provider ID: `ollama` - reference documentation: [ollama](../../providers/inference/remote_ollama.md)
|
||||
- **[vLLM](https://docs.vllm.ai/en/latest/)**: Local or remote vLLM server - provider ID: `vllm` - reference documentation: [vllm](../../providers/inference/remote_vllm.md)
|
||||
- **[TGI](https://github.com/huggingface/text-generation-inference)**: Text Generation Inference server - Dell Enterprise Hub's custom TGI container too (use `DEH_URL`) - provider ID: `tgi` - reference documentation: [tgi](../../providers/inference/remote_tgi.md)
|
||||
- **[Sentence Transformers](https://www.sbert.net/)**: Local embedding models - provider ID: `sentence-transformers` - reference documentation: [sentence-transformers](../../providers/inference/inline_sentence-transformers.md)
|
||||
|
||||
All providers are disabled by default. So you need to enable them by setting the environment variables.
|
||||
|
||||
## Vector IO
|
||||
|
||||
The starter distribution includes a comprehensive set of vector IO providers:
|
||||
|
||||
- **[FAISS](https://github.com/facebookresearch/faiss)**: Local FAISS vector store - enabled by
|
||||
default - provider ID: `faiss`
|
||||
- **[SQLite](https://www.sqlite.org/index.html)**: Local SQLite vector store - disabled by default - provider ID: `sqlite-vec`
|
||||
- **[ChromaDB](https://www.trychroma.com/)**: Remote ChromaDB vector store - disabled by default - provider ID: `chromadb`
|
||||
- **[PGVector](https://github.com/pgvector/pgvector)**: PostgreSQL vector store - disabled by default - provider ID: `pgvector`
|
||||
- **[Milvus](https://milvus.io/)**: Milvus vector store - disabled by default - provider ID: `milvus`
|
||||
|
||||
## Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
### Server Configuration
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
|
||||
### API Keys for Hosted Providers
|
||||
- `OPENAI_API_KEY`: OpenAI API key
|
||||
- `FIREWORKS_API_KEY`: Fireworks API key
|
||||
- `TOGETHER_API_KEY`: Together API key
|
||||
- `ANTHROPIC_API_KEY`: Anthropic API key
|
||||
- `GEMINI_API_KEY`: Google Gemini API key
|
||||
- `GROQ_API_KEY`: Groq API key
|
||||
- `SAMBANOVA_API_KEY`: SambaNova API key
|
||||
- `CEREBRAS_API_KEY`: Cerebras API key
|
||||
- `LLAMA_API_KEY`: Llama API key
|
||||
- `NVIDIA_API_KEY`: NVIDIA API key
|
||||
- `HF_API_TOKEN`: HuggingFace API token
|
||||
|
||||
### Local Provider Configuration
|
||||
- `OLLAMA_URL`: Ollama server URL (default: `http://localhost:11434`)
|
||||
- `VLLM_URL`: vLLM server URL (default: `http://localhost:8000/v1`)
|
||||
- `VLLM_MAX_TOKENS`: vLLM max tokens (default: `4096`)
|
||||
- `VLLM_API_TOKEN`: vLLM API token (default: `fake`)
|
||||
- `VLLM_TLS_VERIFY`: vLLM TLS verification (default: `true`)
|
||||
- `TGI_URL`: TGI server URL
|
||||
|
||||
### Model Configuration
|
||||
- `INFERENCE_MODEL`: HuggingFace model for serverless inference
|
||||
- `INFERENCE_ENDPOINT_NAME`: HuggingFace endpoint name
|
||||
- `OLLAMA_INFERENCE_MODEL`: Ollama model name
|
||||
- `OLLAMA_EMBEDDING_MODEL`: Ollama embedding model name
|
||||
- `OLLAMA_EMBEDDING_DIMENSION`: Ollama embedding dimension (default: `384`)
|
||||
- `VLLM_INFERENCE_MODEL`: vLLM model name
|
||||
|
||||
### Vector Database Configuration
|
||||
- `SQLITE_STORE_DIR`: SQLite store directory (default: `~/.llama/distributions/starter`)
|
||||
- `ENABLE_SQLITE_VEC`: Enable SQLite vector provider
|
||||
- `ENABLE_CHROMADB`: Enable ChromaDB provider
|
||||
- `ENABLE_PGVECTOR`: Enable PGVector provider
|
||||
- `CHROMADB_URL`: ChromaDB server URL
|
||||
- `PGVECTOR_HOST`: PGVector host (default: `localhost`)
|
||||
- `PGVECTOR_PORT`: PGVector port (default: `5432`)
|
||||
- `PGVECTOR_DB`: PGVector database name
|
||||
- `PGVECTOR_USER`: PGVector username
|
||||
- `PGVECTOR_PASSWORD`: PGVector password
|
||||
|
||||
### Tool Configuration
|
||||
- `BRAVE_SEARCH_API_KEY`: Brave Search API key
|
||||
- `TAVILY_SEARCH_API_KEY`: Tavily Search API key
|
||||
|
||||
### Telemetry Configuration
|
||||
- `OTEL_SERVICE_NAME`: OpenTelemetry service name
|
||||
- `TELEMETRY_SINKS`: Telemetry sinks (default: `console,sqlite`)
|
||||
|
||||
## Enabling Providers
|
||||
|
||||
You can enable specific providers by setting their provider ID to a valid value using environment variables. This is useful when you want to use certain providers or don't have the required API keys.
|
||||
|
||||
### Examples of Enabling Providers
|
||||
|
||||
#### Enable FAISS Vector Provider
|
||||
```bash
|
||||
export ENABLE_FAISS=faiss
|
||||
```
|
||||
|
||||
#### Enable Ollama Models
|
||||
```bash
|
||||
export ENABLE_OLLAMA=ollama
|
||||
```
|
||||
|
||||
#### Disable vLLM Models
|
||||
```bash
|
||||
export VLLM_INFERENCE_MODEL=__disabled__
|
||||
```
|
||||
|
||||
#### Disable Optional Vector Providers
|
||||
```bash
|
||||
export ENABLE_SQLITE_VEC=__disabled__
|
||||
export ENABLE_CHROMADB=__disabled__
|
||||
export ENABLE_PGVECTOR=__disabled__
|
||||
```
|
||||
|
||||
### Provider ID Patterns
|
||||
|
||||
The starter distribution uses several patterns for provider IDs:
|
||||
|
||||
1. **Direct provider IDs**: `faiss`, `ollama`, `vllm`
|
||||
2. **Environment-based provider IDs**: `${env.ENABLE_SQLITE_VEC+sqlite-vec}`
|
||||
3. **Model-based provider IDs**: `${env.OLLAMA_INFERENCE_MODEL:__disabled__}`
|
||||
|
||||
When using the `+` pattern (like `${env.ENABLE_SQLITE_VEC+sqlite-vec}`), the provider is enabled by default and can be disabled by setting the environment variable to `__disabled__`.
|
||||
|
||||
When using the `:` pattern (like `${env.OLLAMA_INFERENCE_MODEL:__disabled__}`), the provider is disabled by default and can be enabled by setting the environment variable to a valid value.
|
||||
|
||||
## Running the Distribution
|
||||
|
||||
You can run the starter distribution via Docker or Conda.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-e OPENAI_API_KEY=your_openai_key \
|
||||
-e FIREWORKS_API_KEY=your_fireworks_key \
|
||||
-e TOGETHER_API_KEY=your_together_key \
|
||||
llamastack/distribution-starter \
|
||||
--port $LLAMA_STACK_PORT
|
||||
```
|
||||
|
||||
### Via Conda
|
||||
|
||||
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
|
||||
|
||||
```bash
|
||||
llama stack build --template starter --image-type conda
|
||||
llama stack run distributions/starter/run.yaml \
|
||||
--port 8321 \
|
||||
--env OPENAI_API_KEY=your_openai_key \
|
||||
--env FIREWORKS_API_KEY=your_fireworks_key \
|
||||
--env TOGETHER_API_KEY=your_together_key
|
||||
```
|
||||
|
||||
## Example Usage
|
||||
|
||||
Once the distribution is running, you can use any of the available models. Here are some examples:
|
||||
|
||||
### Using OpenAI Models
|
||||
```bash
|
||||
llama-stack-client --endpoint http://localhost:8321 \
|
||||
inference chat-completion \
|
||||
--model-id openai/gpt-4o \
|
||||
--message "Hello, how are you?"
|
||||
```
|
||||
|
||||
### Using Fireworks Models
|
||||
```bash
|
||||
llama-stack-client --endpoint http://localhost:8321 \
|
||||
inference chat-completion \
|
||||
--model-id fireworks/meta-llama/Llama-3.2-3B-Instruct \
|
||||
--message "Write a short story about a robot."
|
||||
```
|
||||
|
||||
### Using Local Ollama Models
|
||||
```bash
|
||||
# First, make sure Ollama is running and you have a model
|
||||
ollama run llama3.2:3b
|
||||
|
||||
# Then use it through Llama Stack
|
||||
export OLLAMA_INFERENCE_MODEL=llama3.2:3b
|
||||
llama-stack-client --endpoint http://localhost:8321 \
|
||||
inference chat-completion \
|
||||
--model-id ollama/llama3.2:3b \
|
||||
--message "Explain quantum computing in simple terms."
|
||||
```
|
||||
|
||||
## Storage
|
||||
|
||||
The starter distribution uses SQLite for local storage of various components:
|
||||
|
||||
- **Metadata store**: `~/.llama/distributions/starter/registry.db`
|
||||
- **Inference store**: `~/.llama/distributions/starter/inference_store.db`
|
||||
- **FAISS store**: `~/.llama/distributions/starter/faiss_store.db`
|
||||
- **SQLite vector store**: `~/.llama/distributions/starter/sqlite_vec.db`
|
||||
- **Files metadata**: `~/.llama/distributions/starter/files_metadata.db`
|
||||
- **Agents store**: `~/.llama/distributions/starter/agents_store.db`
|
||||
- **Responses store**: `~/.llama/distributions/starter/responses_store.db`
|
||||
- **Trace store**: `~/.llama/distributions/starter/trace_store.db`
|
||||
- **Evaluation store**: `~/.llama/distributions/starter/meta_reference_eval.db`
|
||||
- **Dataset I/O stores**: Various HuggingFace and local filesystem stores
|
||||
|
||||
## Benefits of the Starter Distribution
|
||||
|
||||
1. **Comprehensive Coverage**: Includes most popular AI providers in one distribution
|
||||
2. **Flexible Configuration**: Easy to enable/disable providers based on your needs
|
||||
3. **No Local GPU Required**: Most providers are cloud-based, making it accessible to developers without high-end hardware
|
||||
4. **Easy Migration**: Start with hosted providers and gradually move to local ones as needed
|
||||
5. **Production Ready**: Includes safety, evaluation, and telemetry components
|
||||
6. **Tool Integration**: Comes with web search, RAG, and model context protocol tools
|
||||
|
||||
The starter distribution is ideal for developers who want to experiment with different AI providers, build prototypes quickly, or create applications that can work with multiple AI backends.
|
|
@ -1,149 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
|
||||
# TGI Distribution
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-tgi` distribution consists of the following provider configurations.
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| inference | `remote::tgi`, `inline::sentence-transformers` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
|
||||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
|
||||
You can use this distribution if you have GPUs and want to run an independent TGI server container for running inference.
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `INFERENCE_MODEL`: Inference model loaded into the TGI server (default: `meta-llama/Llama-3.2-3B-Instruct`)
|
||||
- `TGI_URL`: URL of the TGI server with the main inference model (default: `http://127.0.0.1:8080/v1`)
|
||||
- `TGI_SAFETY_URL`: URL of the TGI server with the safety model (default: `http://127.0.0.1:8081/v1`)
|
||||
- `SAFETY_MODEL`: Name of the safety (Llama-Guard) model to use (default: `meta-llama/Llama-Guard-3-1B`)
|
||||
|
||||
|
||||
## Setting up TGI server
|
||||
|
||||
Please check the [TGI Getting Started Guide](https://github.com/huggingface/text-generation-inference?tab=readme-ov-file#get-started) to get a TGI endpoint. Here is a sample script to start a TGI server locally via Docker:
|
||||
|
||||
```bash
|
||||
export INFERENCE_PORT=8080
|
||||
export INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
||||
export CUDA_VISIBLE_DEVICES=0
|
||||
|
||||
docker run --rm -it \
|
||||
--pull always \
|
||||
-v $HOME/.cache/huggingface:/data \
|
||||
-p $INFERENCE_PORT:$INFERENCE_PORT \
|
||||
--gpus $CUDA_VISIBLE_DEVICES \
|
||||
ghcr.io/huggingface/text-generation-inference:2.3.1 \
|
||||
--dtype bfloat16 \
|
||||
--usage-stats off \
|
||||
--sharded false \
|
||||
--cuda-memory-fraction 0.7 \
|
||||
--model-id $INFERENCE_MODEL \
|
||||
--port $INFERENCE_PORT
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, then you will need to also run another instance of a TGI with a corresponding safety model like `meta-llama/Llama-Guard-3-1B` using a script like:
|
||||
|
||||
```bash
|
||||
export SAFETY_PORT=8081
|
||||
export SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
|
||||
export CUDA_VISIBLE_DEVICES=1
|
||||
|
||||
docker run --rm -it \
|
||||
--pull always \
|
||||
-v $HOME/.cache/huggingface:/data \
|
||||
-p $SAFETY_PORT:$SAFETY_PORT \
|
||||
--gpus $CUDA_VISIBLE_DEVICES \
|
||||
ghcr.io/huggingface/text-generation-inference:2.3.1 \
|
||||
--dtype bfloat16 \
|
||||
--usage-stats off \
|
||||
--sharded false \
|
||||
--model-id $SAFETY_MODEL \
|
||||
--port $SAFETY_PORT
|
||||
```
|
||||
|
||||
## Running Llama Stack
|
||||
|
||||
Now you are ready to run Llama Stack with TGI as the inference provider. You can do this via Conda (build code) or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
llamastack/distribution-tgi \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env TGI_URL=http://host.docker.internal:$INFERENCE_PORT
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, use:
|
||||
|
||||
```bash
|
||||
# You need a local checkout of llama-stack to run this, get it using
|
||||
# git clone https://github.com/meta-llama/llama-stack.git
|
||||
cd /path/to/llama-stack
|
||||
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
-v ~/.llama:/root/.llama \
|
||||
-v ./llama_stack/templates/tgi/run-with-safety.yaml:/root/my-run.yaml \
|
||||
llamastack/distribution-tgi \
|
||||
--config /root/my-run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env TGI_URL=http://host.docker.internal:$INFERENCE_PORT \
|
||||
--env SAFETY_MODEL=$SAFETY_MODEL \
|
||||
--env TGI_SAFETY_URL=http://host.docker.internal:$SAFETY_PORT
|
||||
```
|
||||
|
||||
### Via Conda
|
||||
|
||||
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
|
||||
|
||||
```bash
|
||||
llama stack build --template tgi --image-type conda
|
||||
llama stack run ./run.yaml
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env TGI_URL=http://127.0.0.1:$INFERENCE_PORT
|
||||
```
|
||||
|
||||
If you are using Llama Stack Safety / Shield APIs, use:
|
||||
|
||||
```bash
|
||||
llama stack run ./run-with-safety.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env INFERENCE_MODEL=$INFERENCE_MODEL \
|
||||
--env TGI_URL=http://127.0.0.1:$INFERENCE_PORT \
|
||||
--env SAFETY_MODEL=$SAFETY_MODEL \
|
||||
--env TGI_SAFETY_URL=http://127.0.0.1:$SAFETY_PORT
|
||||
```
|
|
@ -1,86 +0,0 @@
|
|||
---
|
||||
orphan: true
|
||||
---
|
||||
<!-- This file was auto-generated by distro_codegen.py, please edit source -->
|
||||
# Together Distribution
|
||||
|
||||
```{toctree}
|
||||
:maxdepth: 2
|
||||
:hidden:
|
||||
|
||||
self
|
||||
```
|
||||
|
||||
The `llamastack/distribution-together` distribution consists of the following provider configurations.
|
||||
|
||||
| API | Provider(s) |
|
||||
|-----|-------------|
|
||||
| agents | `inline::meta-reference` |
|
||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||
| eval | `inline::meta-reference` |
|
||||
| inference | `remote::together`, `inline::sentence-transformers` |
|
||||
| safety | `inline::llama-guard` |
|
||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||
| telemetry | `inline::meta-reference` |
|
||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol`, `remote::wolfram-alpha` |
|
||||
| vector_io | `inline::faiss`, `remote::chromadb`, `remote::pgvector` |
|
||||
|
||||
|
||||
### Environment Variables
|
||||
|
||||
The following environment variables can be configured:
|
||||
|
||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
||||
- `TOGETHER_API_KEY`: Together.AI API Key (default: ``)
|
||||
|
||||
### Models
|
||||
|
||||
The following models are available by default:
|
||||
|
||||
- `meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo (aliases: meta-llama/Llama-3.1-8B-Instruct)`
|
||||
- `meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo (aliases: meta-llama/Llama-3.1-70B-Instruct)`
|
||||
- `meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo (aliases: meta-llama/Llama-3.1-405B-Instruct-FP8)`
|
||||
- `meta-llama/Llama-3.2-3B-Instruct-Turbo (aliases: meta-llama/Llama-3.2-3B-Instruct)`
|
||||
- `meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo (aliases: meta-llama/Llama-3.2-11B-Vision-Instruct)`
|
||||
- `meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo (aliases: meta-llama/Llama-3.2-90B-Vision-Instruct)`
|
||||
- `meta-llama/Llama-3.3-70B-Instruct-Turbo (aliases: meta-llama/Llama-3.3-70B-Instruct)`
|
||||
- `meta-llama/Meta-Llama-Guard-3-8B (aliases: meta-llama/Llama-Guard-3-8B)`
|
||||
- `meta-llama/Llama-Guard-3-11B-Vision-Turbo (aliases: meta-llama/Llama-Guard-3-11B-Vision)`
|
||||
- `togethercomputer/m2-bert-80M-8k-retrieval `
|
||||
- `togethercomputer/m2-bert-80M-32k-retrieval `
|
||||
- `meta-llama/Llama-4-Scout-17B-16E-Instruct (aliases: meta-llama/Llama-4-Scout-17B-16E-Instruct, together/meta-llama/Llama-4-Scout-17B-16E-Instruct)`
|
||||
- `meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 (aliases: meta-llama/Llama-4-Maverick-17B-128E-Instruct, together/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8)`
|
||||
|
||||
|
||||
### Prerequisite: API Keys
|
||||
|
||||
Make sure you have access to a Together API Key. You can get one by visiting [together.xyz](https://together.xyz/).
|
||||
|
||||
|
||||
## Running Llama Stack with Together
|
||||
|
||||
You can do this via Conda (build code) or Docker which has a pre-built image.
|
||||
|
||||
### Via Docker
|
||||
|
||||
This method allows you to get started quickly without having to build the distribution code.
|
||||
|
||||
```bash
|
||||
LLAMA_STACK_PORT=8321
|
||||
docker run \
|
||||
-it \
|
||||
--pull always \
|
||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||
llamastack/distribution-together \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env TOGETHER_API_KEY=$TOGETHER_API_KEY
|
||||
```
|
||||
|
||||
### Via Conda
|
||||
|
||||
```bash
|
||||
llama stack build --template together --image-type conda
|
||||
llama stack run ./run.yaml \
|
||||
--port $LLAMA_STACK_PORT \
|
||||
--env TOGETHER_API_KEY=$TOGETHER_API_KEY
|
||||
```
|
|
@ -58,7 +58,7 @@ which defines the providers and their settings.
|
|||
Now let's build and run the Llama Stack config for Ollama.
|
||||
|
||||
```bash
|
||||
INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type venv --run
|
||||
INFERENCE_MODEL=llama3.2:3b llama stack build --template starter --image-type venv --run
|
||||
```
|
||||
:::
|
||||
:::{tab-item} Using `conda`
|
||||
|
@ -69,7 +69,7 @@ which defines the providers and their settings.
|
|||
Now let's build and run the Llama Stack config for Ollama.
|
||||
|
||||
```bash
|
||||
INFERENCE_MODEL=llama3.2:3b llama stack build --template ollama --image-type conda --image-name llama3-3b-conda --run
|
||||
INFERENCE_MODEL=llama3.2:3b llama stack build --template starter --image-type conda --image-name llama3-3b-conda --run
|
||||
```
|
||||
:::
|
||||
:::{tab-item} Using a Container
|
||||
|
|
|
@ -19,7 +19,7 @@ ollama run llama3.2:3b --keepalive 60m
|
|||
#### Step 2: Run the Llama Stack server
|
||||
We will use `uv` to run the Llama Stack server.
|
||||
```bash
|
||||
INFERENCE_MODEL=llama3.2:3b uv run --with llama-stack llama stack build --template ollama --image-type venv --run
|
||||
INFERENCE_MODEL=llama3.2:3b uv run --with llama-stack llama stack build --template starter --image-type venv --run
|
||||
```
|
||||
#### Step 3: Run the demo
|
||||
Now open up a new terminal and copy the following script into a file named `demo_script.py`.
|
||||
|
|
|
@ -9,13 +9,11 @@ Ollama inference provider for running local models through the Ollama runtime.
|
|||
| Field | Type | Required | Default | Description |
|
||||
|-------|------|----------|---------|-------------|
|
||||
| `url` | `<class 'str'>` | No | http://localhost:11434 | |
|
||||
| `raise_on_connect_error` | `<class 'bool'>` | No | True | |
|
||||
|
||||
## Sample Configuration
|
||||
|
||||
```yaml
|
||||
url: ${env.OLLAMA_URL:=http://localhost:11434}
|
||||
raise_on_connect_error: true
|
||||
|
||||
```
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ RunPod inference provider for running models on RunPod's cloud GPU platform.
|
|||
|
||||
```yaml
|
||||
url: ${env.RUNPOD_URL:=}
|
||||
api_token: ${env.RUNPOD_API_TOKEN:=}
|
||||
api_token: ${env.RUNPOD_API_TOKEN}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ Together AI inference provider for open-source models and collaborative AI devel
|
|||
|
||||
```yaml
|
||||
url: https://api.together.xyz/v1
|
||||
api_key: ${env.TOGETHER_API_KEY:=}
|
||||
api_key: ${env.TOGETHER_API_KEY}
|
||||
|
||||
```
|
||||
|
||||
|
|
|
@ -23,7 +23,7 @@ To use the HF SFTTrainer in your Llama Stack project, follow these steps:
|
|||
You can access the HuggingFace trainer via the `ollama` distribution:
|
||||
|
||||
```bash
|
||||
llama stack build --template ollama --image-type venv
|
||||
llama stack build --template starter --image-type venv
|
||||
llama stack run --image-type venv ~/.llama/distributions/ollama/ollama-run.yaml
|
||||
```
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue