mirror of
https://github.com/meta-llama/llama-stack.git
synced 2025-06-28 19:04:19 +00:00
wip
Signed-off-by: Sébastien Han <seb@redhat.com>
This commit is contained in:
parent
bedfea38c3
commit
6b616cc780
3 changed files with 34 additions and 140 deletions
|
@ -15,7 +15,7 @@ The `llamastack/distribution-starter` distribution is a comprehensive, multi-pro
|
||||||
|
|
||||||
## Provider Composition
|
## Provider Composition
|
||||||
|
|
||||||
The starter distribution consists of the following provider configurations:
|
The starter distribution consists of the following configurations:
|
||||||
|
|
||||||
| API | Provider(s) |
|
| API | Provider(s) |
|
||||||
|-----|-------------|
|
|-----|-------------|
|
||||||
|
@ -23,8 +23,9 @@ The starter distribution consists of the following provider configurations:
|
||||||
| datasetio | `remote::huggingface`, `inline::localfs` |
|
| datasetio | `remote::huggingface`, `inline::localfs` |
|
||||||
| eval | `inline::meta-reference` |
|
| eval | `inline::meta-reference` |
|
||||||
| files | `inline::localfs` |
|
| files | `inline::localfs` |
|
||||||
| inference | `remote::openai`, `remote::fireworks`, `remote::together`, `remote::ollama`, `remote::anthropic`, `remote::gemini`, `remote::groq`, `remote::sambanova`, `remote::vllm`, `remote::tgi`, `remote::cerebras`, `remote::llama-openai-compat`, `remote::nvidia`, `remote::hf::serverless`, `remote::hf::endpoint`, `inline::sentence-transformers` |
|
| inference | `remote::openai`, `remote::fireworks`, `remote::together`, `remote::ollama`, `remote::anthropic`, `remote::gemini`, `remote::groq`, `remote::sambanova`, `remote::vllm`, `remote::tgi`, `remote::cerebras`, `remote::llama-openai-compat`, `remote::nvidia`, `remote::hf::serverless`, `remote::hf::endpoint`, `inline::sentence-transformers`, `remote::passthrough` |
|
||||||
| safety | `inline::llama-guard` |
|
| safety | `inline::llama-guard` |
|
||||||
|
| post_training | `inline::huggingface` |
|
||||||
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
| scoring | `inline::basic`, `inline::llm-as-judge`, `inline::braintrust` |
|
||||||
| telemetry | `inline::meta-reference` |
|
| telemetry | `inline::meta-reference` |
|
||||||
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
|
| tool_runtime | `remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol` |
|
||||||
|
@ -34,8 +35,8 @@ The starter distribution consists of the following provider configurations:
|
||||||
|
|
||||||
The starter distribution includes a comprehensive set of inference providers:
|
The starter distribution includes a comprehensive set of inference providers:
|
||||||
|
|
||||||
### Hosted Providers
|
- **OpenAI**: GPT-4, GPT-3.5, O1, O3, O4 models and text embeddings - point to the relevant provider
|
||||||
- **OpenAI**: GPT-4, GPT-3.5, O1, O3, O4 models and text embeddings
|
configuration documentation for more details
|
||||||
- **Fireworks**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and embeddings
|
- **Fireworks**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and embeddings
|
||||||
- **Together**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and embeddings
|
- **Together**: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and embeddings
|
||||||
- **Anthropic**: Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Haiku, and Voyage embeddings
|
- **Anthropic**: Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Haiku, and Voyage embeddings
|
||||||
|
@ -46,114 +47,40 @@ The starter distribution includes a comprehensive set of inference providers:
|
||||||
- **NVIDIA**: NVIDIA NIM models
|
- **NVIDIA**: NVIDIA NIM models
|
||||||
- **HuggingFace**: Serverless and endpoint models
|
- **HuggingFace**: Serverless and endpoint models
|
||||||
- **Bedrock**: AWS Bedrock models
|
- **Bedrock**: AWS Bedrock models
|
||||||
|
- **Passthrough**: Passthrough provider - use this to connect to any other inference provider that is not supported by Llama Stack
|
||||||
### Local/Remote Providers
|
|
||||||
- **Ollama**: Local Ollama models
|
- **Ollama**: Local Ollama models
|
||||||
- **vLLM**: Local or remote vLLM server
|
- **vLLM**: remote vLLM server
|
||||||
- **TGI**: Text Generation Inference server - Dell Enterprise Hub's custom TGI container too (use `DEH_URL`)
|
- **TGI**: Text Generation Inference server - Dell Enterprise Hub's custom TGI container too (use `DEH_URL`)
|
||||||
- **Sentence Transformers**: Local embedding models
|
- **Sentence Transformers**: Local embedding models
|
||||||
|
|
||||||
All providers are disabled by default. So you need to enable them by setting the environment variables.
|
All providers are **disabled** by default. So you need to enable them by setting the environment
|
||||||
|
variables. See [Enabling Providers](#enabling-providers) for more details.
|
||||||
|
|
||||||
## Environment Variables
|
## Vector Providers
|
||||||
|
|
||||||
The following environment variables can be configured:
|
The starter distribution includes a comprehensive set of vector providers:
|
||||||
|
|
||||||
### Server Configuration
|
- **FAISS**: Local FAISS vector store - enabled by default
|
||||||
- `LLAMA_STACK_PORT`: Port for the Llama Stack distribution server (default: `8321`)
|
- **SQLite**: Local SQLite vector store - disabled by default
|
||||||
|
- **ChromaDB**: Remote ChromaDB server - disabled by default
|
||||||
### API Keys for Hosted Providers
|
- **PGVector**: Remote PGVector server - disabled by default
|
||||||
- `OPENAI_API_KEY`: OpenAI API key
|
|
||||||
- `FIREWORKS_API_KEY`: Fireworks API key
|
|
||||||
- `TOGETHER_API_KEY`: Together API key
|
|
||||||
- `ANTHROPIC_API_KEY`: Anthropic API key
|
|
||||||
- `GEMINI_API_KEY`: Google Gemini API key
|
|
||||||
- `GROQ_API_KEY`: Groq API key
|
|
||||||
- `SAMBANOVA_API_KEY`: SambaNova API key
|
|
||||||
- `CEREBRAS_API_KEY`: Cerebras API key
|
|
||||||
- `LLAMA_API_KEY`: Llama API key
|
|
||||||
- `NVIDIA_API_KEY`: NVIDIA API key
|
|
||||||
- `HF_API_TOKEN`: HuggingFace API token
|
|
||||||
|
|
||||||
### Local Provider Configuration
|
|
||||||
- `OLLAMA_URL`: Ollama server URL (default: `http://localhost:11434`)
|
|
||||||
- `VLLM_URL`: vLLM server URL (default: `http://localhost:8000/v1`)
|
|
||||||
- `VLLM_MAX_TOKENS`: vLLM max tokens (default: `4096`)
|
|
||||||
- `VLLM_API_TOKEN`: vLLM API token (default: `fake`)
|
|
||||||
- `VLLM_TLS_VERIFY`: vLLM TLS verification (default: `true`)
|
|
||||||
- `TGI_URL`: TGI server URL
|
|
||||||
|
|
||||||
### Model Configuration
|
|
||||||
- `INFERENCE_MODEL`: HuggingFace model for serverless inference
|
|
||||||
- `INFERENCE_ENDPOINT_NAME`: HuggingFace endpoint name
|
|
||||||
- `OLLAMA_INFERENCE_MODEL`: Ollama model name
|
|
||||||
- `OLLAMA_EMBEDDING_MODEL`: Ollama embedding model name
|
|
||||||
- `OLLAMA_EMBEDDING_DIMENSION`: Ollama embedding dimension (default: `384`)
|
|
||||||
- `VLLM_INFERENCE_MODEL`: vLLM model name
|
|
||||||
|
|
||||||
### Vector Database Configuration
|
|
||||||
- `SQLITE_STORE_DIR`: SQLite store directory (default: `~/.llama/distributions/starter`)
|
|
||||||
- `ENABLE_SQLITE_VEC`: Enable SQLite vector provider
|
|
||||||
- `ENABLE_CHROMADB`: Enable ChromaDB provider
|
|
||||||
- `ENABLE_PGVECTOR`: Enable PGVector provider
|
|
||||||
- `CHROMADB_URL`: ChromaDB server URL
|
|
||||||
- `PGVECTOR_HOST`: PGVector host (default: `localhost`)
|
|
||||||
- `PGVECTOR_PORT`: PGVector port (default: `5432`)
|
|
||||||
- `PGVECTOR_DB`: PGVector database name
|
|
||||||
- `PGVECTOR_USER`: PGVector username
|
|
||||||
- `PGVECTOR_PASSWORD`: PGVector password
|
|
||||||
|
|
||||||
### Tool Configuration
|
|
||||||
- `BRAVE_SEARCH_API_KEY`: Brave Search API key
|
|
||||||
- `TAVILY_SEARCH_API_KEY`: Tavily Search API key
|
|
||||||
|
|
||||||
### Telemetry Configuration
|
|
||||||
- `OTEL_SERVICE_NAME`: OpenTelemetry service name
|
|
||||||
- `TELEMETRY_SINKS`: Telemetry sinks (default: `console,sqlite`)
|
|
||||||
|
|
||||||
## Enabling Providers
|
## Enabling Providers
|
||||||
|
|
||||||
You can enable specific providers by setting their provider ID to a valid value using environment variables. This is useful when you want to use certain providers or don't have the required API keys.
|
You can enable specific providers by setting their provider ID to a string value using environment
|
||||||
|
variables.
|
||||||
|
|
||||||
### Examples of Enabling Providers
|
For instance, to enable the Ollama provider, you can set the `ENABLE_OLLAMA` environment variable to `ollama`.
|
||||||
|
|
||||||
#### Enable FAISS Vector Provider
|
|
||||||
```bash
|
|
||||||
export ENABLE_FAISS=faiss
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Enable Ollama Models
|
|
||||||
```bash
|
```bash
|
||||||
export ENABLE_OLLAMA=ollama
|
export ENABLE_OLLAMA=ollama
|
||||||
```
|
```
|
||||||
|
|
||||||
#### Disable vLLM Models
|
To disable a provider, you can set the environment variable to `ENABLE_OLLAMA=__disabled__`.
|
||||||
```bash
|
|
||||||
export VLLM_INFERENCE_MODEL=__disabled__
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Disable Optional Vector Providers
|
|
||||||
```bash
|
|
||||||
export ENABLE_SQLITE_VEC=__disabled__
|
|
||||||
export ENABLE_CHROMADB=__disabled__
|
|
||||||
export ENABLE_PGVECTOR=__disabled__
|
|
||||||
```
|
|
||||||
|
|
||||||
### Provider ID Patterns
|
|
||||||
|
|
||||||
The starter distribution uses several patterns for provider IDs:
|
|
||||||
|
|
||||||
1. **Direct provider IDs**: `faiss`, `ollama`, `vllm`
|
|
||||||
2. **Environment-based provider IDs**: `${env.ENABLE_SQLITE_VEC+sqlite-vec}`
|
|
||||||
3. **Model-based provider IDs**: `${env.OLLAMA_INFERENCE_MODEL:__disabled__}`
|
|
||||||
|
|
||||||
When using the `+` pattern (like `${env.ENABLE_SQLITE_VEC+sqlite-vec}`), the provider is enabled by default and can be disabled by setting the environment variable to `__disabled__`.
|
|
||||||
|
|
||||||
When using the `:` pattern (like `${env.OLLAMA_INFERENCE_MODEL:__disabled__}`), the provider is disabled by default and can be enabled by setting the environment variable to a valid value.
|
|
||||||
|
|
||||||
## Running the Distribution
|
## Running the Distribution
|
||||||
|
|
||||||
You can run the starter distribution via Docker or Conda.
|
You can run the starter distribution via Docker or directly using the Llama Stack CLI.
|
||||||
|
|
||||||
### Via Docker
|
### Via Docker
|
||||||
|
|
||||||
|
@ -165,57 +92,19 @@ docker run \
|
||||||
-it \
|
-it \
|
||||||
--pull always \
|
--pull always \
|
||||||
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
-p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
|
||||||
-e OPENAI_API_KEY=your_openai_key \
|
-e ENABLE_OLLAMA=ollama \
|
||||||
-e FIREWORKS_API_KEY=your_fireworks_key \
|
-e OLLAMA_INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
|
||||||
-e TOGETHER_API_KEY=your_together_key \
|
|
||||||
llamastack/distribution-starter \
|
llamastack/distribution-starter \
|
||||||
--port $LLAMA_STACK_PORT
|
--port $LLAMA_STACK_PORT
|
||||||
```
|
```
|
||||||
|
|
||||||
### Via Conda
|
You can also use the `llama stack run` command to run the distribution.
|
||||||
|
|
||||||
Make sure you have done `uv pip install llama-stack` and have the Llama Stack CLI available.
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
llama stack build --template starter --image-type conda
|
|
||||||
llama stack run distributions/starter/run.yaml \
|
llama stack run distributions/starter/run.yaml \
|
||||||
--port 8321 \
|
--port 8321 \
|
||||||
--env OPENAI_API_KEY=your_openai_key \
|
--env ENABLE_OLLAMA=ollama \
|
||||||
--env FIREWORKS_API_KEY=your_fireworks_key \
|
--env OLLAMA_INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
|
||||||
--env TOGETHER_API_KEY=your_together_key
|
|
||||||
```
|
|
||||||
|
|
||||||
## Example Usage
|
|
||||||
|
|
||||||
Once the distribution is running, you can use any of the available models. Here are some examples:
|
|
||||||
|
|
||||||
### Using OpenAI Models
|
|
||||||
```bash
|
|
||||||
llama-stack-client --endpoint http://localhost:8321 \
|
|
||||||
inference chat-completion \
|
|
||||||
--model-id openai/gpt-4o \
|
|
||||||
--message "Hello, how are you?"
|
|
||||||
```
|
|
||||||
|
|
||||||
### Using Fireworks Models
|
|
||||||
```bash
|
|
||||||
llama-stack-client --endpoint http://localhost:8321 \
|
|
||||||
inference chat-completion \
|
|
||||||
--model-id fireworks/meta-llama/Llama-3.2-3B-Instruct \
|
|
||||||
--message "Write a short story about a robot."
|
|
||||||
```
|
|
||||||
|
|
||||||
### Using Local Ollama Models
|
|
||||||
```bash
|
|
||||||
# First, make sure Ollama is running and you have a model
|
|
||||||
ollama run llama3.2:3b
|
|
||||||
|
|
||||||
# Then use it through Llama Stack
|
|
||||||
export OLLAMA_INFERENCE_MODEL=llama3.2:3b
|
|
||||||
llama-stack-client --endpoint http://localhost:8321 \
|
|
||||||
inference chat-completion \
|
|
||||||
--model-id ollama/llama3.2:3b \
|
|
||||||
--message "Explain quantum computing in simple terms."
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Storage
|
## Storage
|
||||||
|
|
|
@ -106,11 +106,7 @@ providers:
|
||||||
type: sqlite
|
type: sqlite
|
||||||
namespace: null
|
namespace: null
|
||||||
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/starter}/faiss_store.db
|
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/starter}/faiss_store.db
|
||||||
<<<<<<< HEAD
|
|
||||||
- provider_id: ${env.ENABLE_SQLITE_VEC:+sqlite-vec}
|
|
||||||
=======
|
|
||||||
- provider_id: ${env.ENABLE_SQLITE_VEC:=__disabled__}
|
- provider_id: ${env.ENABLE_SQLITE_VEC:=__disabled__}
|
||||||
>>>>>>> fbcc565e (feat: consolidate most distros into "starter")
|
|
||||||
provider_type: inline::sqlite-vec
|
provider_type: inline::sqlite-vec
|
||||||
config:
|
config:
|
||||||
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/starter}/sqlite_vec.db
|
db_path: ${env.SQLITE_STORE_DIR:=~/.llama/distributions/starter}/sqlite_vec.db
|
||||||
|
@ -597,6 +593,11 @@ models:
|
||||||
provider_id: ${env.ENABLE_OLLAMA:=__disabled__}
|
provider_id: ${env.ENABLE_OLLAMA:=__disabled__}
|
||||||
provider_model_id: ${env.OLLAMA_EMBEDDING_MODEL:=__disabled__}
|
provider_model_id: ${env.OLLAMA_EMBEDDING_MODEL:=__disabled__}
|
||||||
model_type: embedding
|
model_type: embedding
|
||||||
|
- metadata: {}
|
||||||
|
model_id: ${env.ENABLE_OLLAMA:=__disabled__}/${env.OLLAMA_SAFETY_MODEL:=__disabled__}
|
||||||
|
provider_id: ${env.ENABLE_OLLAMA:=__disabled__}
|
||||||
|
provider_model_id: ${env.OLLAMA_SAFETY_MODEL:=__disabled__}
|
||||||
|
model_type: llm
|
||||||
- metadata: {}
|
- metadata: {}
|
||||||
model_id: ${env.ENABLE_ANTHROPIC:=__disabled__}/anthropic/claude-3-5-sonnet-latest
|
model_id: ${env.ENABLE_ANTHROPIC:=__disabled__}/anthropic/claude-3-5-sonnet-latest
|
||||||
provider_id: ${env.ENABLE_ANTHROPIC:=__disabled__}
|
provider_id: ${env.ENABLE_ANTHROPIC:=__disabled__}
|
||||||
|
|
|
@ -110,6 +110,10 @@ def get_inference_providers() -> tuple[list[Provider], dict[str, list[ProviderMo
|
||||||
"embedding_dimension": "${env.OLLAMA_EMBEDDING_DIMENSION:=384}",
|
"embedding_dimension": "${env.OLLAMA_EMBEDDING_DIMENSION:=384}",
|
||||||
},
|
},
|
||||||
),
|
),
|
||||||
|
ProviderModelEntry(
|
||||||
|
provider_model_id="${env.OLLAMA_SAFETY_MODEL:=__disabled__}",
|
||||||
|
model_type=ModelType.llm,
|
||||||
|
),
|
||||||
],
|
],
|
||||||
OllamaImplConfig.sample_run_config(
|
OllamaImplConfig.sample_run_config(
|
||||||
url="${env.OLLAMA_URL:=http://localhost:11434}", raise_on_connect_error=False
|
url="${env.OLLAMA_URL:=http://localhost:11434}", raise_on_connect_error=False
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue