mirror of https://github.com/meta-llama/llama-stack.git synced 2025-08-15 22:18:00 +00:00

Ashwin Bharambe cc87995e2b

chore: rename templates to distributions (#3035 )

As the title says. Distributions is in, Templates is out.

`llama stack build --template` --> `llama stack build --distro`. For
backward compatibility, the previous option is kept but results in a
warning.

Updated `server.py` to remove the "config_or_template" backward
compatibility since it has been a couple releases since that change.

2025-08-04 11:34:17 -07:00

15 KiB

Raw Permalink Blame History

orphan
true

Starter Distribution

:maxdepth: 2
:hidden:

self

The llamastack/distribution-starter distribution is a comprehensive, multi-provider distribution that includes most of the available inference providers in Llama Stack. It's designed to be a one-stop solution for developers who want to experiment with different AI providers without having to configure each one individually.

Provider Composition

The starter distribution consists of the following provider configurations:

API	Provider(s)
agents	`inline::meta-reference`
datasetio	`remote::huggingface`, `inline::localfs`
eval	`inline::meta-reference`
files	`inline::localfs`
inference	`remote::openai`, `remote::fireworks`, `remote::together`, `remote::ollama`, `remote::anthropic`, `remote::gemini`, `remote::groq`, `remote::sambanova`, `remote::vllm`, `remote::tgi`, `remote::cerebras`, `remote::llama-openai-compat`, `remote::nvidia`, `remote::hf::serverless`, `remote::hf::endpoint`, `inline::sentence-transformers`
safety	`inline::llama-guard`
scoring	`inline::basic`, `inline::llm-as-judge`, `inline::braintrust`
telemetry	`inline::meta-reference`
tool_runtime	`remote::brave-search`, `remote::tavily-search`, `inline::rag-runtime`, `remote::model-context-protocol`
vector_io	`inline::faiss`, `inline::sqlite-vec`, `inline::milvus`, `remote::chromadb`, `remote::pgvector`

Inference Providers

The starter distribution includes a comprehensive set of inference providers:

Hosted Providers

OpenAI: GPT-4, GPT-3.5, O1, O3, O4 models and text embeddings - provider ID: openai - reference documentation: openai
Fireworks: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and embeddings - provider ID: fireworks - reference documentation: fireworks
Together: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models and embeddings - provider ID: together - reference documentation: together
Anthropic: Claude 3.5 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Haiku, and Voyage embeddings - provider ID: anthropic - reference documentation: anthropic
Gemini: Gemini 1.5, 2.0, 2.5 models and text embeddings - provider ID: gemini - reference documentation: gemini
Groq: Fast Llama models (3.1, 3.2, 3.3, 4 Scout, 4 Maverick) - provider ID: groq - reference documentation: groq
SambaNova: Llama 3.1, 3.2, 3.3, 4 Scout, 4 Maverick models - provider ID: sambanova - reference documentation: sambanova
Cerebras: Cerebras AI models - provider ID: cerebras - reference documentation: cerebras
NVIDIA: NVIDIA NIM - provider ID: nvidia - reference documentation: nvidia
HuggingFace: Serverless and endpoint models - provider ID: hf::serverless and hf::endpoint - reference documentation: huggingface-serverless and huggingface-endpoint
Bedrock: AWS Bedrock models - provider ID: bedrock - reference documentation: bedrock

Local/Remote Providers

Ollama: Local Ollama models - provider ID: ollama - reference documentation: ollama
vLLM: Local or remote vLLM server - provider ID: vllm - reference documentation: vllm
TGI: Text Generation Inference server - Dell Enterprise Hub's custom TGI container too (use DEH_URL) - provider ID: tgi - reference documentation: tgi
Sentence Transformers: Local embedding models - provider ID: sentence-transformers - reference documentation: sentence-transformers

All providers are disabled by default. So you need to enable them by setting the environment variables.

Vector IO

The starter distribution includes a comprehensive set of vector IO providers:

FAISS: Local FAISS vector store - enabled by default - provider ID: faiss
SQLite: Local SQLite vector store - disabled by default - provider ID: sqlite-vec
ChromaDB: Remote ChromaDB vector store - disabled by default - provider ID: chromadb
PGVector: PostgreSQL vector store - disabled by default - provider ID: pgvector
Milvus: Milvus vector store - disabled by default - provider ID: milvus

Environment Variables

The following environment variables can be configured:

Server Configuration

LLAMA_STACK_PORT: Port for the Llama Stack distribution server (default: 8321)

API Keys for Hosted Providers

OPENAI_API_KEY: OpenAI API key
FIREWORKS_API_KEY: Fireworks API key
TOGETHER_API_KEY: Together API key
ANTHROPIC_API_KEY: Anthropic API key
GEMINI_API_KEY: Google Gemini API key
GROQ_API_KEY: Groq API key
SAMBANOVA_API_KEY: SambaNova API key
CEREBRAS_API_KEY: Cerebras API key
LLAMA_API_KEY: Llama API key
NVIDIA_API_KEY: NVIDIA API key
HF_API_TOKEN: HuggingFace API token

Local Provider Configuration

OLLAMA_URL: Ollama server URL (default: http://localhost:11434)
VLLM_URL: vLLM server URL (default: http://localhost:8000/v1)
VLLM_MAX_TOKENS: vLLM max tokens (default: 4096)
VLLM_API_TOKEN: vLLM API token (default: fake)
VLLM_TLS_VERIFY: vLLM TLS verification (default: true)
TGI_URL: TGI server URL

Model Configuration

INFERENCE_MODEL: HuggingFace model for serverless inference
INFERENCE_ENDPOINT_NAME: HuggingFace endpoint name

Vector Database Configuration

SQLITE_STORE_DIR: SQLite store directory (default: ~/.llama/distributions/starter)
ENABLE_SQLITE_VEC: Enable SQLite vector provider
ENABLE_CHROMADB: Enable ChromaDB provider
ENABLE_PGVECTOR: Enable PGVector provider
CHROMADB_URL: ChromaDB server URL
PGVECTOR_HOST: PGVector host (default: localhost)
PGVECTOR_PORT: PGVector port (default: 5432)
PGVECTOR_DB: PGVector database name
PGVECTOR_USER: PGVector username
PGVECTOR_PASSWORD: PGVector password

Tool Configuration

BRAVE_SEARCH_API_KEY: Brave Search API key
TAVILY_SEARCH_API_KEY: Tavily Search API key

Telemetry Configuration

OTEL_SERVICE_NAME: OpenTelemetry service name
TELEMETRY_SINKS: Telemetry sinks (default: console,sqlite)

Enabling Providers

You can enable specific providers by setting appropriate environment variables. For example,

# self-hosted
export OLLAMA_URL=http://localhost:11434   # enables the Ollama inference provider
export VLLM_URL=http://localhost:8000/v1   # enables the vLLM inference provider
export TGI_URL=http://localhost:8000/v1   # enables the TGI inference provider

# cloud-hosted requiring API key configuration on the server
export CEREBRAS_API_KEY=your_cerebras_api_key   # enables the Cerebras inference provider
export NVIDIA_API_KEY=your_nvidia_api_key   # enables the NVIDIA inference provider

# vector providers
export MILVUS_URL=http://localhost:19530   # enables the Milvus vector provider
export CHROMADB_URL=http://localhost:8000/v1   # enables the ChromaDB vector provider
export PGVECTOR_DB=llama_stack_db   # enables the PGVector vector provider

This distribution comes with a default "llama-guard" shield that can be enabled by setting the SAFETY_MODEL environment variable to point to an appropriate Llama Guard model id. Use llama-stack-client models list to see the list of available models.

Running the Distribution

You can run the starter distribution via Docker or venv.

Via Docker

This method allows you to get started quickly without having to build the distribution code.

LLAMA_STACK_PORT=8321
docker run \
  -it \
  --pull always \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -e OPENAI_API_KEY=your_openai_key \
  -e FIREWORKS_API_KEY=your_fireworks_key \
  -e TOGETHER_API_KEY=your_together_key \
  llamastack/distribution-starter \
  --port $LLAMA_STACK_PORT

Via venv

Ensure you have configured the starter distribution using the environment variables explained above.

uv run --with llama-stack llama stack build --distro starter --image-type venv --run

Example Usage

Once the distribution is running, you can use any of the available models. Here are some examples:

Using OpenAI Models

llama-stack-client --endpoint http://localhost:8321 \
inference chat-completion \
--model-id openai/gpt-4o \
--message "Hello, how are you?"

Using Fireworks Models

llama-stack-client --endpoint http://localhost:8321 \
inference chat-completion \
--model-id fireworks/meta-llama/Llama-3.2-3B-Instruct \
--message "Write a short story about a robot."

Using Local Ollama Models

# First, make sure Ollama is running and you have a model
ollama run llama3.2:3b

# Then use it through Llama Stack
export OLLAMA_INFERENCE_MODEL=llama3.2:3b
llama-stack-client --endpoint http://localhost:8321 \
inference chat-completion \
--model-id ollama/llama3.2:3b \
--message "Explain quantum computing in simple terms."

Storage

The starter distribution uses SQLite for local storage of various components:

Metadata store: ~/.llama/distributions/starter/registry.db
Inference store: ~/.llama/distributions/starter/inference_store.db
FAISS store: ~/.llama/distributions/starter/faiss_store.db
SQLite vector store: ~/.llama/distributions/starter/sqlite_vec.db
Files metadata: ~/.llama/distributions/starter/files_metadata.db
Agents store: ~/.llama/distributions/starter/agents_store.db
Responses store: ~/.llama/distributions/starter/responses_store.db
Trace store: ~/.llama/distributions/starter/trace_store.db
Evaluation store: ~/.llama/distributions/starter/meta_reference_eval.db
Dataset I/O stores: Various HuggingFace and local filesystem stores

Benefits of the Starter Distribution

Comprehensive Coverage: Includes most popular AI providers in one distribution
Flexible Configuration: Easy to enable/disable providers based on your needs
No Local GPU Required: Most providers are cloud-based, making it accessible to developers without high-end hardware
Easy Migration: Start with hosted providers and gradually move to local ones as needed
Production Ready: Includes safety, evaluation, and telemetry components
Tool Integration: Comes with web search, RAG, and model context protocol tools

The starter distribution is ideal for developers who want to experiment with different AI providers, build prototypes quickly, or create applications that can work with multiple AI backends.

15 KiB Raw Permalink Blame History