# What does this PR do? This commit enhances the signal handling mechanism in the server by improving the `handle_signal` (previously handle_sigint) function. It now properly retrieves the signal name, ensuring clearer logging when a termination signal is received. Additionally, it cancels all running tasks and waits for their completion before stopping the event loop, allowing for a more graceful shutdown. Support for handling SIGTERM has also been added alongside SIGINT. Before the changes, handle_sigint used asyncio.run(run_shutdown()). However, asyncio.run() is meant to start a new event loop, and calling it inside an existing one (like when running Uvicorn) raises an error. The fix replaces asyncio.run(run_shutdown()) with an async function scheduled on the existing loop using loop.create_task(shutdown()). This ensures that the shutdown coroutine runs within the current event loop instead of trying to create a new one. Furthermore, this commit updates the project dependencies. `fastapi` and `uvicorn` have been added to the development dependencies in `pyproject.toml` and `uv.lock`, ensuring that the necessary packages are available for development and execution. Closes: https://github.com/meta-llama/llama-stack/issues/1043 Signed-off-by: Sébastien Han <seb@redhat.com> [//]: # (If resolving an issue, uncomment and update the line below) [//]: # (Closes #[issue-number]) ## Test Plan Run a server and send SIGINT: ``` INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" python -m llama_stack.distribution.server.server --yaml-config ./llama_stack/templates/ollama/run.yaml Using config file: llama_stack/templates/ollama/run.yaml Run configuration: apis: - agents - datasetio - eval - inference - safety - scoring - telemetry - tool_runtime - vector_io container_image: null datasets: [] eval_tasks: [] image_name: ollama metadata_store: db_path: /Users/leseb/.llama/distributions/ollama/registry.db namespace: null type: sqlite models: - metadata: {} model_id: meta-llama/Llama-3.2-3B-Instruct model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType - llm provider_id: ollama provider_model_id: null - metadata: embedding_dimension: 384 model_id: all-MiniLM-L6-v2 model_type: !!python/object/apply:llama_stack.apis.models.models.ModelType - embedding provider_id: sentence-transformers provider_model_id: null providers: agents: - config: persistence_store: db_path: /Users/leseb/.llama/distributions/ollama/agents_store.db namespace: null type: sqlite provider_id: meta-reference provider_type: inline::meta-reference datasetio: - config: {} provider_id: huggingface provider_type: remote::huggingface - config: {} provider_id: localfs provider_type: inline::localfs eval: - config: {} provider_id: meta-reference provider_type: inline::meta-reference inference: - config: url: http://localhost:11434 provider_id: ollama provider_type: remote::ollama - config: {} provider_id: sentence-transformers provider_type: inline::sentence-transformers safety: - config: {} provider_id: llama-guard provider_type: inline::llama-guard scoring: - config: {} provider_id: basic provider_type: inline::basic - config: {} provider_id: llm-as-judge provider_type: inline::llm-as-judge - config: openai_api_key: '********' provider_id: braintrust provider_type: inline::braintrust telemetry: - config: service_name: llama-stack sinks: console,sqlite sqlite_db_path: /Users/leseb/.llama/distributions/ollama/trace_store.db provider_id: meta-reference provider_type: inline::meta-reference tool_runtime: - config: api_key: '********' max_results: 3 provider_id: brave-search provider_type: remote::brave-search - config: api_key: '********' max_results: 3 provider_id: tavily-search provider_type: remote::tavily-search - config: {} provider_id: code-interpreter provider_type: inline::code-interpreter - config: {} provider_id: rag-runtime provider_type: inline::rag-runtime vector_io: - config: kvstore: db_path: /Users/leseb/.llama/distributions/ollama/faiss_store.db namespace: null type: sqlite provider_id: faiss provider_type: inline::faiss scoring_fns: [] server: port: 8321 tls_certfile: null tls_keyfile: null shields: [] tool_groups: - args: null mcp_endpoint: null provider_id: tavily-search toolgroup_id: builtin::websearch - args: null mcp_endpoint: null provider_id: rag-runtime toolgroup_id: builtin::rag - args: null mcp_endpoint: null provider_id: code-interpreter toolgroup_id: builtin::code_interpreter vector_dbs: [] version: '2' INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:213: Resolved 31 providers INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-inference => ollama INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-inference => sentence-transformers INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: models => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inference => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-vector_io => faiss INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-safety => llama-guard INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: shields => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: safety => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: vector_dbs => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: vector_io => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-tool_runtime => brave-search INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-tool_runtime => tavily-search INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-tool_runtime => code-interpreter INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-tool_runtime => rag-runtime INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: tool_groups => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: tool_runtime => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: agents => meta-reference INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-datasetio => huggingface INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-datasetio => localfs INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: datasets => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: datasetio => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: telemetry => meta-reference INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-scoring => basic INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-scoring => llm-as-judge INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-scoring => braintrust INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: scoring_functions => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: scoring => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inner-eval => meta-reference INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: eval_tasks => __routing_table__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: eval => __autorouted__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:215: inspect => __builtin__ INFO 2025-02-12 10:21:03,540 llama_stack.distribution.resolver:216: INFO 2025-02-12 10:21:03,723 llama_stack.providers.remote.inference.ollama.ollama:148: checking connectivity to Ollama at `http://localhost:11434`... INFO 2025-02-12 10:21:03,734 httpx:1740: HTTP Request: GET http://localhost:11434/api/ps "HTTP/1.1 200 OK" INFO 2025-02-12 10:21:03,843 faiss.loader:148: Loading faiss. INFO 2025-02-12 10:21:03,865 faiss.loader:150: Successfully loaded faiss. INFO 2025-02-12 10:21:03,868 faiss:173: Failed to load GPU Faiss: name 'GpuIndexIVFFlat' is not defined. Will not load constructor refs for GPU indexes. Warning: `bwrap` is not available. Code interpreter tool will not work correctly. INFO 2025-02-12 10:21:04,315 datasets:54: PyTorch version 2.6.0 available. INFO 2025-02-12 10:21:04,556 httpx:1740: HTTP Request: GET http://localhost:11434/api/ps "HTTP/1.1 200 OK" INFO 2025-02-12 10:21:04,557 llama_stack.providers.utils.inference.embedding_mixin:42: Loading sentence transformer for all-MiniLM-L6-v2... INFO 2025-02-12 10:21:07,202 sentence_transformers.SentenceTransformer:210: Use pytorch device_name: mps INFO 2025-02-12 10:21:07,202 sentence_transformers.SentenceTransformer:218: Load pretrained SentenceTransformer: all-MiniLM-L6-v2 INFO 2025-02-12 10:21:09,500 llama_stack.distribution.stack:102: Models: all-MiniLM-L6-v2 served by sentence-transformers INFO 2025-02-12 10:21:09,500 llama_stack.distribution.stack:102: Models: meta-llama/Llama-3.2-3B-Instruct served by ollama INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: basic::equality served by basic INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: basic::regex_parser_multiple_choice_answer served by basic INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: basic::subset_of served by basic INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::answer-correctness served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::answer-relevancy served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::answer-similarity served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::context-entity-recall served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::context-precision served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::context-recall served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::context-relevancy served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::factuality served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: braintrust::faithfulness served by braintrust INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: llm-as-judge::405b-simpleqa served by llm-as-judge INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Scoring_fns: llm-as-judge::base served by llm-as-judge INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Tool_groups: builtin::code_interpreter served by code-interpreter INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Tool_groups: builtin::rag served by rag-runtime INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:102: Tool_groups: builtin::websearch served by tavily-search INFO 2025-02-12 10:21:09,501 llama_stack.distribution.stack:106: Serving API eval POST /v1/eval/tasks/{task_id}/evaluations DELETE /v1/eval/tasks/{task_id}/jobs/{job_id} GET /v1/eval/tasks/{task_id}/jobs/{job_id}/result GET /v1/eval/tasks/{task_id}/jobs/{job_id} POST /v1/eval/tasks/{task_id}/jobs Serving API agents POST /v1/agents POST /v1/agents/{agent_id}/session POST /v1/agents/{agent_id}/session/{session_id}/turn DELETE /v1/agents/{agent_id} DELETE /v1/agents/{agent_id}/session/{session_id} GET /v1/agents/{agent_id}/session/{session_id} GET /v1/agents/{agent_id}/session/{session_id}/turn/{turn_id}/step/{step_id} GET /v1/agents/{agent_id}/session/{session_id}/turn/{turn_id} Serving API scoring_functions GET /v1/scoring-functions/{scoring_fn_id} GET /v1/scoring-functions POST /v1/scoring-functions Serving API safety POST /v1/safety/run-shield Serving API inspect GET /v1/health GET /v1/inspect/providers GET /v1/inspect/routes GET /v1/version Serving API tool_runtime POST /v1/tool-runtime/invoke GET /v1/tool-runtime/list-tools POST /v1/tool-runtime/rag-tool/insert POST /v1/tool-runtime/rag-tool/query Serving API datasetio POST /v1/datasetio/rows GET /v1/datasetio/rows Serving API shields GET /v1/shields/{identifier} GET /v1/shields POST /v1/shields Serving API eval_tasks GET /v1/eval-tasks/{eval_task_id} GET /v1/eval-tasks POST /v1/eval-tasks Serving API models GET /v1/models/{model_id} GET /v1/models POST /v1/models DELETE /v1/models/{model_id} Serving API datasets GET /v1/datasets/{dataset_id} GET /v1/datasets POST /v1/datasets DELETE /v1/datasets/{dataset_id} Serving API vector_io POST /v1/vector-io/insert POST /v1/vector-io/query Serving API inference POST /v1/inference/chat-completion POST /v1/inference/completion POST /v1/inference/embeddings Serving API tool_groups GET /v1/tools/{tool_name} GET /v1/toolgroups/{toolgroup_id} GET /v1/toolgroups GET /v1/tools POST /v1/toolgroups DELETE /v1/toolgroups/{toolgroup_id} Serving API vector_dbs GET /v1/vector-dbs/{vector_db_id} GET /v1/vector-dbs POST /v1/vector-dbs DELETE /v1/vector-dbs/{vector_db_id} Serving API scoring POST /v1/scoring/score POST /v1/scoring/score-batch Serving API telemetry GET /v1/telemetry/traces/{trace_id}/spans/{span_id} GET /v1/telemetry/spans/{span_id}/tree GET /v1/telemetry/traces/{trace_id} POST /v1/telemetry/events GET /v1/telemetry/spans GET /v1/telemetry/traces POST /v1/telemetry/spans/export Listening on ['::', '0.0.0.0']:5001 INFO: Started server process [65372] INFO: Waiting for application startup. INFO: ASGI 'lifespan' protocol appears unsupported. INFO: Application startup complete. INFO: Uvicorn running on http://['::', '0.0.0.0']:5001 (Press CTRL+C to quit) ^CINFO: Shutting down INFO: Finished server process [65372] Received signal SIGINT (2). Exiting gracefully... INFO 2025-02-12 10:21:11,215 __main__:151: Shutting down ModelsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down InferenceRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ShieldsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down SafetyRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down VectorDBsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down VectorIORouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ToolGroupsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ToolRuntimeRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down MetaReferenceAgentsImpl INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down DatasetsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down DatasetIORouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down TelemetryAdapter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ScoringFunctionsRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down ScoringRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down EvalTasksRoutingTable INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down EvalRouter INFO 2025-02-12 10:21:11,216 __main__:151: Shutting down DistributionInspectImpl ``` [//]: # (## Documentation) [//]: # (- [ ] Added a Changelog entry if the change is significant) Signed-off-by: Sébastien Han <seb@redhat.com> |
||
---|---|---|
.github | ||
distributions | ||
docs | ||
llama_stack | ||
rfcs | ||
tests/client-sdk | ||
.gitignore | ||
.gitmodules | ||
.pre-commit-config.yaml | ||
.readthedocs.yaml | ||
.ruff.toml | ||
CODE_OF_CONDUCT.md | ||
CONTRIBUTING.md | ||
LICENSE | ||
MANIFEST.in | ||
pyproject.toml | ||
README.md | ||
requirements.txt | ||
SECURITY.md | ||
uv.lock |
Llama Stack
Quick Start | Documentation | Colab Notebook
Llama Stack standardizes the core building blocks that simplify AI application development. It codifies best practices across the Llama ecosystem. More specifically, it provides
- Unified API layer for Inference, RAG, Agents, Tools, Safety, Evals, and Telemetry.
- Plugin architecture to support the rich ecosystem of different API implementations in various environments, including local development, on-premises, cloud, and mobile.
- Prepackaged verified distributions which offer a one-stop solution for developers to get started quickly and reliably in any environment.
- Multiple developer interfaces like CLI and SDKs for Python, Typescript, iOS, and Android.
- Standalone applications as examples for how to build production-grade AI applications with Llama Stack.
Llama Stack Benefits
- Flexible Options: Developers can choose their preferred infrastructure without changing APIs and enjoy flexible deployment choices.
- Consistent Experience: With its unified APIs, Llama Stack makes it easier to build, test, and deploy AI applications with consistent application behavior.
- Robust Ecosystem: Llama Stack is already integrated with distribution partners (cloud providers, hardware vendors, and AI-focused companies) that offer tailored infrastructure, software, and services for deploying Llama models.
By reducing friction and complexity, Llama Stack empowers developers to focus on what they do best: building transformative generative AI applications.
API Providers
Here is a list of the various API providers and available distributions that can help developers get started easily with Llama Stack.
API Provider Builder | Environments | Agents | Inference | Memory | Safety | Telemetry |
---|---|---|---|---|---|---|
Meta Reference | Single Node | ✅ | ✅ | ✅ | ✅ | ✅ |
SambaNova | Hosted | ✅ | ||||
Cerebras | Hosted | ✅ | ||||
Fireworks | Hosted | ✅ | ✅ | ✅ | ||
AWS Bedrock | Hosted | ✅ | ✅ | |||
Together | Hosted | ✅ | ✅ | ✅ | ||
Groq | Hosted | ✅ | ||||
Ollama | Single Node | ✅ | ||||
TGI | Hosted and Single Node | ✅ | ||||
NVIDIA NIM | Hosted and Single Node | ✅ | ||||
Chroma | Single Node | ✅ | ||||
PG Vector | Single Node | ✅ | ||||
PyTorch ExecuTorch | On-device iOS | ✅ | ✅ | |||
vLLM | Hosted and Single Node | ✅ |
Distributions
A Llama Stack Distribution (or "distro") is a pre-configured bundle of provider implementations for each API component. Distributions make it easy to get started with a specific deployment scenario - you can begin with a local development setup (eg. ollama) and seamlessly transition to production (eg. Fireworks) without changing your application code. Here are some of the distributions we support:
Distribution | Llama Stack Docker | Start This Distribution |
---|---|---|
Meta Reference | llamastack/distribution-meta-reference-gpu | Guide |
Meta Reference Quantized | llamastack/distribution-meta-reference-quantized-gpu | Guide |
SambaNova | llamastack/distribution-sambanova | Guide |
Cerebras | llamastack/distribution-cerebras | Guide |
Ollama | llamastack/distribution-ollama | Guide |
TGI | llamastack/distribution-tgi | Guide |
Together | llamastack/distribution-together | Guide |
Fireworks | llamastack/distribution-fireworks | Guide |
vLLM | llamastack/distribution-remote-vllm | Guide |
Installation
You have two ways to install this repository:
-
Install as a package: You can install the repository directly from PyPI by running the following command:
pip install llama-stack
-
Install from source: If you prefer to install from the source code, make sure you have conda installed. Then, run the following commands:
mkdir -p ~/local cd ~/local git clone git@github.com:meta-llama/llama-stack.git conda create -n stack python=3.10 conda activate stack cd llama-stack pip install -e .
Documentation
Please checkout our Documentation page for more details.
- CLI references
- llama (server-side) CLI Reference: Guide for using the
llama
CLI to work with Llama models (download, study prompts), and building/starting a Llama Stack distribution. - llama (client-side) CLI Reference: Guide for using the
llama-stack-client
CLI, which allows you to query information about the distribution.
- llama (server-side) CLI Reference: Guide for using the
- Getting Started
- Quick guide to start a Llama Stack server.
- Jupyter notebook to walk-through how to use simple text and vision inference llama_stack_client APIs
- The complete Llama Stack lesson Colab notebook of the new Llama 3.2 course on Deeplearning.ai.
- A Zero-to-Hero Guide that guide you through all the key components of llama stack with code samples.
- Contributing
- Adding a new API Provider to walk-through how to add a new API provider.
Llama Stack Client SDKs
Language | Client SDK | Package |
---|---|---|
Python | llama-stack-client-python | |
Swift | llama-stack-client-swift | |
Typescript | llama-stack-client-typescript | |
Kotlin | llama-stack-client-kotlin |
Check out our client SDKs for connecting to a Llama Stack server in your preferred language, you can choose from python, typescript, swift, and kotlin programming languages to quickly build your applications.
You can find more example scripts with client SDKs to talk with the Llama Stack server in our llama-stack-apps repo.