llama-stack-mirror/docs/source/distributions/self_hosted_distro/starter.md
Sébastien Han bedfea38c3
feat: consolidate most distros into "starter"
* Removes a bunch of distros
* Removed distros were added into the "starter" distribution
* Doc for "starter" has been added
* Partially reverts https://github.com/meta-llama/llama-stack/pull/2482
  since inference providers are disabled by default and can be turned on
  manually via env variable.
* Disables safety in starter distro

Closes: #2502
Signed-off-by: Sébastien Han <seb@redhat.com>
2025-06-27 10:15:27 +02:00

9.3 KiB

orphan
true

Starter Distribution

:maxdepth: 2
:hidden:

self

The llamastack/distribution-starter distribution is a comprehensive, multi-provider distribution that includes most of the available inference providers in Llama Stack. It's designed to be a one-stop solution for developers who want to experiment with different AI providers without having to configure each one individually.

Provider Composition

The starter distribution consists of the following provider configurations:

API Provider(s)
agents inline::meta-reference
datasetio remote::huggingface, inline::localfs
eval inline::meta-reference
files inline::localfs
inference remote::openai, remote::fireworks, remote::together, remote::ollama, remote::anthropic, remote::gemini, remote::groq, remote::sambanova, remote::vllm, remote::tgi, remote::cerebras, remote::llama-openai-compat, remote::nvidia, remote::hf::serverless, remote::hf::endpoint, inline::sentence-transformers
safety inline::llama-guard
scoring inline::basic, inline::llm-as-judge, inline::braintrust
telemetry inline::meta-reference
tool_runtime remote::brave-search, remote::tavily-search, inline::rag-runtime, remote::model-context-protocol
vector_io inline::faiss, inline::sqlite-vec, remote::chromadb, remote::pgvector

Inference Providers

The starter distribution includes a comprehensive set of inference providers:

Hosted Providers

  • OpenAI: GPT-4, GPT-3.5, O1, O3, O4 models and text embeddings
  • Fireworks: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and embeddings
  • Together: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and embeddings
  • Anthropic: Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Haiku, and Voyage embeddings
  • Gemini: Gemini 1.5, 2.0, 2.5 models and text embeddings
  • Groq: Fast Llama models (3.1, 3.2, 3.3, 4 Scout, 4 Maverick)
  • SambaNova: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models
  • Cerebras: Cerebras AI models
  • NVIDIA: NVIDIA NIM models
  • HuggingFace: Serverless and endpoint models
  • Bedrock: AWS Bedrock models

Local/Remote Providers

  • Ollama: Local Ollama models
  • vLLM: Local or remote vLLM server
  • TGI: Text Generation Inference server - Dell Enterprise Hub's custom TGI container too (use DEH_URL)
  • Sentence Transformers: Local embedding models

All providers are disabled by default. So you need to enable them by setting the environment variables.

Environment Variables

The following environment variables can be configured:

Server Configuration

  • LLAMA_STACK_PORT: Port for the Llama Stack distribution server (default: 8321)

API Keys for Hosted Providers

  • OPENAI_API_KEY: OpenAI API key
  • FIREWORKS_API_KEY: Fireworks API key
  • TOGETHER_API_KEY: Together API key
  • ANTHROPIC_API_KEY: Anthropic API key
  • GEMINI_API_KEY: Google Gemini API key
  • GROQ_API_KEY: Groq API key
  • SAMBANOVA_API_KEY: SambaNova API key
  • CEREBRAS_API_KEY: Cerebras API key
  • LLAMA_API_KEY: Llama API key
  • NVIDIA_API_KEY: NVIDIA API key
  • HF_API_TOKEN: HuggingFace API token

Local Provider Configuration

  • OLLAMA_URL: Ollama server URL (default: http://localhost:11434)
  • VLLM_URL: vLLM server URL (default: http://localhost:8000/v1)
  • VLLM_MAX_TOKENS: vLLM max tokens (default: 4096)
  • VLLM_API_TOKEN: vLLM API token (default: fake)
  • VLLM_TLS_VERIFY: vLLM TLS verification (default: true)
  • TGI_URL: TGI server URL

Model Configuration

  • INFERENCE_MODEL: HuggingFace model for serverless inference
  • INFERENCE_ENDPOINT_NAME: HuggingFace endpoint name
  • OLLAMA_INFERENCE_MODEL: Ollama model name
  • OLLAMA_EMBEDDING_MODEL: Ollama embedding model name
  • OLLAMA_EMBEDDING_DIMENSION: Ollama embedding dimension (default: 384)
  • VLLM_INFERENCE_MODEL: vLLM model name

Vector Database Configuration

  • SQLITE_STORE_DIR: SQLite store directory (default: ~/.llama/distributions/starter)
  • ENABLE_SQLITE_VEC: Enable SQLite vector provider
  • ENABLE_CHROMADB: Enable ChromaDB provider
  • ENABLE_PGVECTOR: Enable PGVector provider
  • CHROMADB_URL: ChromaDB server URL
  • PGVECTOR_HOST: PGVector host (default: localhost)
  • PGVECTOR_PORT: PGVector port (default: 5432)
  • PGVECTOR_DB: PGVector database name
  • PGVECTOR_USER: PGVector username
  • PGVECTOR_PASSWORD: PGVector password

Tool Configuration

  • BRAVE_SEARCH_API_KEY: Brave Search API key
  • TAVILY_SEARCH_API_KEY: Tavily Search API key

Telemetry Configuration

  • OTEL_SERVICE_NAME: OpenTelemetry service name
  • TELEMETRY_SINKS: Telemetry sinks (default: console,sqlite)

Enabling Providers

You can enable specific providers by setting their provider ID to a valid value using environment variables. This is useful when you want to use certain providers or don't have the required API keys.

Examples of Enabling Providers

Enable FAISS Vector Provider

export ENABLE_FAISS=faiss

Enable Ollama Models

export ENABLE_OLLAMA=ollama

Disable vLLM Models

export VLLM_INFERENCE_MODEL=__disabled__

Disable Optional Vector Providers

export ENABLE_SQLITE_VEC=__disabled__
export ENABLE_CHROMADB=__disabled__
export ENABLE_PGVECTOR=__disabled__

Provider ID Patterns

The starter distribution uses several patterns for provider IDs:

  1. Direct provider IDs: faiss, ollama, vllm
  2. Environment-based provider IDs: ${env.ENABLE_SQLITE_VEC+sqlite-vec}
  3. Model-based provider IDs: ${env.OLLAMA_INFERENCE_MODEL:__disabled__}

When using the + pattern (like ${env.ENABLE_SQLITE_VEC+sqlite-vec}), the provider is enabled by default and can be disabled by setting the environment variable to __disabled__.

When using the : pattern (like ${env.OLLAMA_INFERENCE_MODEL:__disabled__}), the provider is disabled by default and can be enabled by setting the environment variable to a valid value.

Running the Distribution

You can run the starter distribution via Docker or Conda.

Via Docker

This method allows you to get started quickly without having to build the distribution code.

LLAMA_STACK_PORT=8321
docker run \
  -it \
  --pull always \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -e OPENAI_API_KEY=your_openai_key \
  -e FIREWORKS_API_KEY=your_fireworks_key \
  -e TOGETHER_API_KEY=your_together_key \
  llamastack/distribution-starter \
  --port $LLAMA_STACK_PORT

Via Conda

Make sure you have done uv pip install llama-stack and have the Llama Stack CLI available.

llama stack build --template starter --image-type conda
llama stack run distributions/starter/run.yaml \
  --port 8321 \
  --env OPENAI_API_KEY=your_openai_key \
  --env FIREWORKS_API_KEY=your_fireworks_key \
  --env TOGETHER_API_KEY=your_together_key

Example Usage

Once the distribution is running, you can use any of the available models. Here are some examples:

Using OpenAI Models

llama-stack-client --endpoint http://localhost:8321 \
inference chat-completion \
--model-id openai/gpt-4o \
--message "Hello, how are you?"

Using Fireworks Models

llama-stack-client --endpoint http://localhost:8321 \
inference chat-completion \
--model-id fireworks/meta-llama/Llama-3.2-3B-Instruct \
--message "Write a short story about a robot."

Using Local Ollama Models

# First, make sure Ollama is running and you have a model
ollama run llama3.2:3b

# Then use it through Llama Stack
export OLLAMA_INFERENCE_MODEL=llama3.2:3b
llama-stack-client --endpoint http://localhost:8321 \
inference chat-completion \
--model-id ollama/llama3.2:3b \
--message "Explain quantum computing in simple terms."

Storage

The starter distribution uses SQLite for local storage of various components:

  • Metadata store: ~/.llama/distributions/starter/registry.db
  • Inference store: ~/.llama/distributions/starter/inference_store.db
  • FAISS store: ~/.llama/distributions/starter/faiss_store.db
  • SQLite vector store: ~/.llama/distributions/starter/sqlite_vec.db
  • Files metadata: ~/.llama/distributions/starter/files_metadata.db
  • Agents store: ~/.llama/distributions/starter/agents_store.db
  • Responses store: ~/.llama/distributions/starter/responses_store.db
  • Trace store: ~/.llama/distributions/starter/trace_store.db
  • Evaluation store: ~/.llama/distributions/starter/meta_reference_eval.db
  • Dataset I/O stores: Various HuggingFace and local filesystem stores

Benefits of the Starter Distribution

  1. Comprehensive Coverage: Includes most popular AI providers in one distribution
  2. Flexible Configuration: Easy to enable/disable providers based on your needs
  3. No Local GPU Required: Most providers are cloud-based, making it accessible to developers without high-end hardware
  4. Easy Migration: Start with hosted providers and gradually move to local ones as needed
  5. Production Ready: Includes safety, evaluation, and telemetry components
  6. Tool Integration: Comes with web search, RAG, and model context protocol tools

The starter distribution is ideal for developers who want to experiment with different AI providers, build prototypes quickly, or create applications that can work with multiple AI backends.