Composable building blocks to build Llama Apps https://llama-stack.readthedocs.io
Find a file
Ihar Hrachyshka a2f054607d
fix: cancel scheduler tasks on shutdown (#2130)
# What does this PR do?

Scheduler: cancel tasks on shutdown.

Otherwise the currently running tasks will never exit (before they
actually complete), which means the process can't be properly shut down
(only with SIGKILL).

Ideally, we let tasks know that they are about to shutdown and give them
some time to do so; but in the lack of the mechanism, it's better to
cancel than linger forever.

[//]: # (If resolving an issue, uncomment and update the line below)
[//]: # (Closes #[issue-number])

## Test Plan

Start a long running task (e.g. torchtune or external kfp-provider
training).
Ctr-C the process in TTY. Confirm it exits in reasonable time.

```
^CINFO:     Shutting down
INFO:     Waiting for application shutdown.
13:32:26.187 - INFO - Shutting down
13:32:26.187 - INFO - Shutting down DatasetsRoutingTable
13:32:26.187 - INFO - Shutting down DatasetIORouter
13:32:26.187 - INFO - Shutting down TorchtuneKFPPostTrainingImpl
    Traceback (most recent call last):
      File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 118, in run
        return self._loop.run_until_complete(task)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
        return future.result()
               ^^^^^^^^^^^^^^^
    asyncio.exceptions.CancelledError

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "<frozen runpy>", line 198, in _run_module_as_main
      File "<frozen runpy>", line 88, in _run_code
      File "/Users/ihrachys/src/llama-stack-provider-kfp-trainer/.venv/lib/python3.12/site-packages/kfp/dsl/executor_main.py", line 109, in <module>
        executor_main()
      File "/Users/ihrachys/src/llama-stack-provider-kfp-trainer/.venv/lib/python3.12/site-packages/kfp/dsl/executor_main.py", line 101, in executor_main
        output_file = executor.execute()
                      ^^^^^^^^^^^^^^^^^^
      File "/Users/ihrachys/src/llama-stack-provider-kfp-trainer/.venv/lib/python3.12/site-packages/kfp/dsl/executor.py", line 361, in execute
        result = self.func(**func_kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/var/folders/45/1q1rx6cn7jbcn2ty852w0g_r0000gn/T/tmp.RKpPrvTWDD/ephemeral_component.py", line 118, in component
        asyncio.run(recipe.setup())
      File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 194, in run
        return runner.run(main)
               ^^^^^^^^^^^^^^^^
      File "/opt/homebrew/Cellar/python@3.12/3.12.4/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/runners.py", line 123, in run
        raise KeyboardInterrupt()
    KeyboardInterrupt


13:32:31.219 - ERROR - Task 'component' finished with status FAILURE
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
INFO     2025-05-09 13:32:31,221 llama_stack.providers.utils.scheduler:221 scheduler: Job
         test-jobc3c2e1e4-859c-4852-a41d-ef29e55e3efa: Pipeline [1m[95m'test-jobc3c2e1e4-859c-4852-a41d-ef29e55e3efa'[1m[0m
         finished with status [1m[91mFAILURE[1m[0m. Inner task failed: [1m[96m'component'[1m[0m.
ERROR    2025-05-09 13:32:31,223 llama_stack_provider_kfp_trainer.scheduler:54 scheduler: Job
         test-jobc3c2e1e4-859c-4852-a41d-ef29e55e3efa failed.
         ╭───────────────────────────────────── Traceback (most recent call last) ─────────────────────────────────────╮
         │ /Users/ihrachys/src/llama-stack-provider-kfp-trainer/src/llama_stack_provider_kfp_trainer/scheduler.py:45   │
         │ in do                                                                                                       │
         │                                                                                                             │
         │    42 │   │   │                                                                                             │
         │    43 │   │   │   job.status = JobStatus.running                                                            │
         │    44 │   │   │   try:                                                                                      │
         │ ❱  45 │   │   │   │   artifacts = self._to_artifacts(job.handler().output)                                  │
         │    46 │   │   │   │   for artifact in artifacts:                                                            │
         │    47 │   │   │   │   │   on_artifact_collected_cb(artifact)                                                │
         │    48                                                                                                       │
         │                                                                                                             │
         │ /Users/ihrachys/src/llama-stack-provider-kfp-trainer/.venv/lib/python3.12/site-packages/kfp/dsl/base_compon │
         │ ent.py:101 in __call__                                                                                      │
         │                                                                                                             │
         │    98 │   │   │   │   f'{self.name}() missing {len(missing_arguments)} required '                           │
         │    99 │   │   │   │   f'{argument_or_arguments}: {arguments}.')                                             │
         │   100 │   │                                                                                                 │
         │ ❱ 101 │   │   return pipeline_task.PipelineTask(                                                            │
         │   102 │   │   │   component_spec=self.component_spec,                                                       │
         │   103 │   │   │   args=task_inputs,                                                                         │
         │   104 │   │   │   execute_locally=pipeline_context.Pipeline.get_default_pipeline() is                       │
         │                                                                                                             │
         │ /Users/ihrachys/src/llama-stack-provider-kfp-trainer/.venv/lib/python3.12/site-packages/kfp/dsl/pipeline_ta │
         │ sk.py:187 in __init__                                                                                       │
         │                                                                                                             │
         │   184 │   │   ])                                                                                            │
         │   185 │   │                                                                                                 │
         │   186 │   │   if execute_locally:                                                                           │
         │ ❱ 187 │   │   │   self._execute_locally(args=args)                                                          │
         │   188 │                                                                                                     │
         │   189 │   def _execute_locally(self, args: Dict[str, Any]) -> None:                                         │
         │   190 │   │   """Execute the pipeline task locally.                                                         │
         │                                                                                                             │
         │ /Users/ihrachys/src/llama-stack-provider-kfp-trainer/.venv/lib/python3.12/site-packages/kfp/dsl/pipeline_ta │
         │ sk.py:197 in _execute_locally                                                                               │
         │                                                                                                             │
         │   194 │   │   from kfp.local import task_dispatcher                                                         │
         │   195 │   │                                                                                                 │
         │   196 │   │   if self.pipeline_spec is not None:                                                            │
         │ ❱ 197 │   │   │   self._outputs = pipeline_orchestrator.run_local_pipeline(                                 │
         │   198 │   │   │   │   pipeline_spec=self.pipeline_spec,                                                     │
         │   199 │   │   │   │   arguments=args,                                                                       │
         │   200 │   │   │   )                                                                                         │
         │                                                                                                             │
         │ /Users/ihrachys/src/llama-stack-provider-kfp-trainer/.venv/lib/python3.12/site-packages/kfp/local/pipeline_ │
         │ orchestrator.py:43 in run_local_pipeline                                                                    │
         │                                                                                                             │
         │    40 │                                                                                                     │
         │    41 │   # validate and access all global state in this function, not downstream                           │
         │    42 │   config.LocalExecutionConfig.validate()                                                            │
         │ ❱  43 │   return _run_local_pipeline_implementation(                                                        │
         │    44 │   │   pipeline_spec=pipeline_spec,                                                                  │
         │    45 │   │   arguments=arguments,                                                                          │
         │    46 │   │   raise_on_error=config.LocalExecutionConfig.instance.raise_on_error,                           │
         │                                                                                                             │
         │ /Users/ihrachys/src/llama-stack-provider-kfp-trainer/.venv/lib/python3.12/site-packages/kfp/local/pipeline_ │
         │ orchestrator.py:108 in _run_local_pipeline_implementation                                                   │
         │                                                                                                             │
         │   105 │   │   │   )                                                                                         │
         │   106 │   │   return outputs                                                                                │
         │   107 │   elif dag_status == status.Status.FAILURE:                                                         │
         │ ❱ 108 │   │   log_and_maybe_raise_for_failure(                                                              │
         │   109 │   │   │   pipeline_name=pipeline_name,                                                              │
         │   110 │   │   │   fail_stack=fail_stack,                                                                    │
         │   111 │   │   │   raise_on_error=raise_on_error,                                                            │
         │                                                                                                             │
         │ /Users/ihrachys/src/llama-stack-provider-kfp-trainer/.venv/lib/python3.12/site-packages/kfp/local/pipeline_ │
         │ orchestrator.py:137 in log_and_maybe_raise_for_failure                                                      │
         │                                                                                                             │
         │   134 │   │   logging_utils.format_task_name(task_name) for task_name in fail_stack)                        │
         │   135 │   msg = f'Pipeline {pipeline_name_with_color} finished with status                                  │
         │       {status_with_color}. Inner task failed: {task_chain_with_color}.'                                     │
         │   136 │   if raise_on_error:                                                                                │
         │ ❱ 137 │   │   raise RuntimeError(msg)                                                                       │
         │   138 │   with logging_utils.local_logger_context():                                                        │
         │   139 │   │   logging.error(msg)                                                                            │
         │   140                                                                                                       │
         ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
         RuntimeError: Pipeline [1m[95m'test-jobc3c2e1e4-859c-4852-a41d-ef29e55e3efa'[1m[0m finished with status
         [1m[91mFAILURE[1m[0m. Inner task failed: [1m[96m'component'[1m[0m.
INFO     2025-05-09 13:32:31,266 llama_stack.distribution.server.server:136 server: Shutting down
         DistributionInspectImpl
INFO     2025-05-09 13:32:31,266 llama_stack.distribution.server.server:136 server: Shutting down ProviderImpl
INFO:     Application shutdown complete.
INFO:     Finished server process [26648]
```

[//]: # (## Documentation)

Signed-off-by: Ihar Hrachyshka <ihar.hrachyshka@gmail.com>
2025-06-19 17:01:33 +02:00
.github ci: add python package build test (#2457) 2025-06-19 18:57:32 +05:30
docs feat: support filters in file search (#2472) 2025-06-18 21:50:55 -07:00
llama_stack fix: cancel scheduler tasks on shutdown (#2130) 2025-06-19 17:01:33 +02:00
rfcs chore: remove straggler references to llama-models (#1345) 2025-03-01 14:26:03 -08:00
scripts ci: add python package build test (#2457) 2025-06-19 18:57:32 +05:30
tests fix: cancel scheduler tasks on shutdown (#2130) 2025-06-19 17:01:33 +02:00
.coveragerc chore: exclude test, provider, and template directories from coverage (#2028) 2025-04-25 12:16:57 -07:00
.gitignore feat(ui): add infinite scroll pagination to chat completions/responses logs table (#2466) 2025-06-18 15:28:39 -07:00
.pre-commit-config.yaml ci: add python package build test (#2457) 2025-06-19 18:57:32 +05:30
.readthedocs.yaml fix: build docs without requirements.txt (#2294) 2025-05-27 16:27:57 -07:00
CHANGELOG.md docs: Add recent releases (#2424) 2025-06-10 08:43:02 +05:30
CODE_OF_CONDUCT.md Initial commit 2024-07-23 08:32:33 -07:00
CONTRIBUTING.md docs: update contributing guidance around uv python versions (#2398) 2025-06-04 23:12:03 -07:00
install.sh fix: clarify bash requirement in install flow (#2450) 2025-06-17 13:03:28 +05:30
LICENSE Update LICENSE (#47) 2024-08-29 07:39:50 -07:00
MANIFEST.in chore: remove dependencies.json (#2281) 2025-05-27 10:26:57 -07:00
pyproject.toml feat: drop python 3.10 support (#2469) 2025-06-19 12:07:14 +05:30
README.md fix: clarify bash requirement in install flow (#2450) 2025-06-17 13:03:28 +05:30
requirements.txt build: Bump version to 0.2.11 2025-06-17 19:08:17 +00:00
SECURITY.md Create SECURITY.md 2024-10-08 13:30:40 -04:00
uv.lock feat: drop python 3.10 support (#2469) 2025-06-19 12:07:14 +05:30

Llama Stack

PyPI version PyPI - Downloads License Discord Unit Tests Integration Tests

Quick Start | Documentation | Colab Notebook | Discord

🎉 Llama 4 Support 🎉

We released Version 0.2.0 with support for the Llama 4 herd of models released by Meta.

👋 Click here to see how to run Llama 4 models on Llama Stack


Note you need 8xH100 GPU-host to run these models

pip install -U llama_stack

MODEL="Llama-4-Scout-17B-16E-Instruct"
# get meta url from llama.com
llama model download --source meta --model-id $MODEL --meta-url <META_URL>

# start a llama stack server
INFERENCE_MODEL=meta-llama/$MODEL llama stack build --run --template meta-reference-gpu

# install client to interact with the server
pip install llama-stack-client

CLI

# Run a chat completion
llama-stack-client --endpoint http://localhost:8321 \
inference chat-completion \
--model-id meta-llama/$MODEL \
--message "write a haiku for meta's llama 4 models"

ChatCompletionResponse(
    completion_message=CompletionMessage(content="Whispers in code born\nLlama's gentle, wise heartbeat\nFuture's soft unfold", role='assistant', stop_reason='end_of_turn', tool_calls=[]),
    logprobs=None,
    metrics=[Metric(metric='prompt_tokens', value=21.0, unit=None), Metric(metric='completion_tokens', value=28.0, unit=None), Metric(metric='total_tokens', value=49.0, unit=None)]
)

Python SDK

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f"http://localhost:8321")

model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
prompt = "Write a haiku about coding"

print(f"User> {prompt}")
response = client.inference.chat_completion(
    model_id=model_id,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt},
    ],
)
print(f"Assistant> {response.completion_message.content}")

As more providers start supporting Llama 4, you can use them in Llama Stack as well. We are adding to the list. Stay tuned!

🚀 One-Line Installer 🚀

To try Llama Stack locally, run:

curl -LsSf https://github.com/meta-llama/llama-stack/raw/main/install.sh | bash

Overview

Llama Stack standardizes the core building blocks that simplify AI application development. It codifies best practices across the Llama ecosystem. More specifically, it provides

  • Unified API layer for Inference, RAG, Agents, Tools, Safety, Evals, and Telemetry.
  • Plugin architecture to support the rich ecosystem of different API implementations in various environments, including local development, on-premises, cloud, and mobile.
  • Prepackaged verified distributions which offer a one-stop solution for developers to get started quickly and reliably in any environment.
  • Multiple developer interfaces like CLI and SDKs for Python, Typescript, iOS, and Android.
  • Standalone applications as examples for how to build production-grade AI applications with Llama Stack.
Llama Stack

Llama Stack Benefits

  • Flexible Options: Developers can choose their preferred infrastructure without changing APIs and enjoy flexible deployment choices.
  • Consistent Experience: With its unified APIs, Llama Stack makes it easier to build, test, and deploy AI applications with consistent application behavior.
  • Robust Ecosystem: Llama Stack is already integrated with distribution partners (cloud providers, hardware vendors, and AI-focused companies) that offer tailored infrastructure, software, and services for deploying Llama models.

By reducing friction and complexity, Llama Stack empowers developers to focus on what they do best: building transformative generative AI applications.

API Providers

Here is a list of the various API providers and available distributions that can help developers get started easily with Llama Stack.

API Provider Builder Environments Agents Inference Memory Safety Telemetry Post Training
Meta Reference Single Node
SambaNova Hosted
Cerebras Hosted
Fireworks Hosted
AWS Bedrock Hosted
Together Hosted
Groq Hosted
Ollama Single Node
TGI Hosted and Single Node
NVIDIA NIM Hosted and Single Node
Chroma Single Node
PG Vector Single Node
PyTorch ExecuTorch On-device iOS
vLLM Hosted and Single Node
OpenAI Hosted
Anthropic Hosted
Gemini Hosted
watsonx Hosted
HuggingFace Single Node
TorchTune Single Node
NVIDIA NEMO Hosted

Distributions

A Llama Stack Distribution (or "distro") is a pre-configured bundle of provider implementations for each API component. Distributions make it easy to get started with a specific deployment scenario - you can begin with a local development setup (eg. ollama) and seamlessly transition to production (eg. Fireworks) without changing your application code. Here are some of the distributions we support:

Distribution Llama Stack Docker Start This Distribution
Meta Reference llamastack/distribution-meta-reference-gpu Guide
SambaNova llamastack/distribution-sambanova Guide
Cerebras llamastack/distribution-cerebras Guide
Ollama llamastack/distribution-ollama Guide
TGI llamastack/distribution-tgi Guide
Together llamastack/distribution-together Guide
Fireworks llamastack/distribution-fireworks Guide
vLLM llamastack/distribution-remote-vllm Guide

Documentation

Please checkout our Documentation page for more details.

Llama Stack Client SDKs

Language Client SDK Package
Python llama-stack-client-python PyPI version
Swift llama-stack-client-swift Swift Package Index
Typescript llama-stack-client-typescript NPM version
Kotlin llama-stack-client-kotlin Maven version

Check out our client SDKs for connecting to a Llama Stack server in your preferred language, you can choose from python, typescript, swift, and kotlin programming languages to quickly build your applications.

You can find more example scripts with client SDKs to talk with the Llama Stack server in our llama-stack-apps repo.