llama-stack-mirror/docs/source/distributions/building_distro.md
Henry Tu 64c6df8392
Cerebras Inference Integration (#265)
Adding Cerebras Inference as an API provider.

## Testing

### Conda
```
$ llama stack build --template cerebras --image-type conda
$ llama stack run ~/.llama/distributions/llamastack-cerebras/cerebras-run.yaml
...
Listening on ['::', '0.0.0.0']:5000
INFO:     Started server process [12443]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)
```

### Chat Completion
```
$ curl --location 'http://localhost:5000/alpha/inference/chat-completion' --header 'Content-Type: application/json' --data '{
    "model_id": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [
        {
            "role": "user",
            "content": "What is the temperature in Seattle right now?"
        }
    ],
    "stream": false,
    "sampling_params": {
        "strategy": "top_p",
        "temperature": 0.5,
        "max_tokens": 100
    },                   
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "tools": [                   
        {
            "tool_name": "getTemperature",
            "description": "Gets the current temperature of a location.",
            "parameters": {                                              
                "location": {
                    "param_type": "string",
                    "description": "The name of the place to get the temperature from in degress celsius.",
                    "required": true                                                                       
                }                   
            }    
        }    
    ]    
}' 
```

#### Non-Streaming Response
```
{
  "completion_message": {
    "role": "assistant",
    "content": "",
    "stop_reason": "end_of_message",
    "tool_calls": [
      {
        "call_id": "6f42fdcc-6cbb-46ad-a17b-5d20ac64b678",
        "tool_name": "getTemperature",
        "arguments": {
          "location": "Seattle"
        }
      }
    ]
  },
  "logprobs": null
}
```

#### Streaming Response
```
data: {"event":{"event_type":"start","delta":"","logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"","parse_status":"started"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"{\"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"type","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"function","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\",","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"name","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"get","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"Temperature","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\",","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"parameters","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" {\"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"location","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\":","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":" \"","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"Seattle","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":"\"}}","parse_status":"in_progress"},"logprobs":null,"stop_reason":null}}
data: {"event":{"event_type":"progress","delta":{"content":{"call_id":"e742df1f-0ae9-40ad-a49e-18e5c905484f","tool_name":"getTemperature","arguments":{"location":"Seattle"}},"parse_status":"success"},"logprobs":null,"stop_reason":"end_of_message"}}
data: {"event":{"event_type":"complete","delta":"","logprobs":null,"stop_reason":"end_of_message"}}
```

### Completion
```
$ curl --location 'http://localhost:5000/alpha/inference/completion' --header 'Content-Type: application/json' --data '{
    "model_id": "meta-llama/Llama-3.1-8B-Instruct",
    "content": "1,2,3,",
    "stream": true,
    "sampling_params": {
        "strategy": "top_p",
        "temperature": 0.5,
        "max_tokens": 10
    },                   
    "tool_choice": "auto",
    "tool_prompt_format": "json",
    "tools": [                   
        {
            "tool_name": "getTemperature",
            "description": "Gets the current temperature of a location.",
            "parameters": {                                              
                "location": {
                    "param_type": "string",
                    "description": "The name of the place to get the temperature from in degress celsius.",
                    "required": true                                                                       
                }                   
            }    
        }    
    ]    
}'
```

#### Non-Streaming Response
```
{
  "content": "4,5,6,7,8,",
  "stop_reason": "out_of_tokens",
  "logprobs": null
}
```

#### Streaming Response
```
data: {"delta":"4","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"5","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"6","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"7","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"8","stop_reason":null,"logprobs":null}
data: {"delta":",","stop_reason":null,"logprobs":null}
data: {"delta":"","stop_reason":null,"logprobs":null}
data: {"delta":"","stop_reason":"out_of_tokens","logprobs":null}
```

### Pre-Commit Checks
```
trim trailing whitespace.................................................Passed
check python ast.........................................................Passed
check for merge conflicts................................................Passed
check for added large files..............................................Passed
fix end of files.........................................................Passed
Insert license in comments...............................................Passed
flake8...................................................................Passed
Format files with µfmt...................................................Passed
```

### Testing with `test_inference.py`
```
$ export CEREBRAS_API_KEY=<insert API key here>
$ pytest -v -s llama_stack/providers/tests/inference/test_text_inference.py -m "cerebras and llama_8b" 
/net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/lib/python3.12/site-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=================================================== test session starts ===================================================
platform linux -- Python 3.12.3, pytest-8.3.3, pluggy-1.5.0 -- /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack/.venv/bin/python3.12
cachedir: .pytest_cache
rootdir: /net/henryt-dev/srv/nfs/henryt-data/ws/llama-stack
configfile: pyproject.toml
plugins: anyio-4.6.2.post1, asyncio-0.24.0
asyncio: mode=Mode.STRICT, default_loop_scope=None
collected 128 items / 120 deselected / 8 selected                                                                         

llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_model_list[llama_8b-cerebras] Resolved 4 providers
 inner-inference => cerebras
 models => __routing_table__
 inference => __autorouted__
 inspect => __builtin__

Models: meta-llama/Llama-3.1-8B-Instruct served by cerebras

PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completion[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_completions_structured_output[llama_8b-cerebras] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_non_streaming[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_structured_output[llama_8b-cerebras] SKIPPED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_streaming[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling[llama_8b-cerebras] PASSED
llama_stack/providers/tests/inference/test_text_inference.py::TestInference::test_chat_completion_with_tool_calling_streaming[llama_8b-cerebras] PASSED

================================ 6 passed, 2 skipped, 120 deselected, 6 warnings in 3.95s =================================
```

I ran `python llama_stack/scripts/distro_codegen.py` to run codegen.
2024-12-03 21:15:32 -08:00

42 KiB

Build your own Distribution

This guide will walk you through the steps to get started with building a Llama Stack distribution from scratch with your choice of API providers.

Llama Stack Build

In order to build your own distribution, we recommend you clone the llama-stack repository.

git clone git@github.com:meta-llama/llama-stack.git
cd llama-stack
pip install -e .

llama stack build -h

We will start build our distribution (in the form of a Conda environment, or Docker image). In this step, we will specify:

  • name: the name for our distribution (e.g. my-stack)
  • image_type: our build image type (conda | docker)
  • distribution_spec: our distribution specs for specifying API providers
    • description: a short description of the configurations for the distribution
    • providers: specifies the underlying implementation for serving each API endpoint
    • image_type: conda | docker to specify whether to build the distribution in the form of Docker image or Conda environment.

After this step is complete, a file named <name>-build.yaml and template file <name>-run.yaml will be generated and saved at the output file path specified at the end of the command.

::::{tab-set} :::{tab-item} Building from Scratch

  • For a new user, we could start off with running llama stack build which will allow you to a interactively enter wizard where you will be prompted to enter build configurations.
llama stack build

> Enter a name for your Llama Stack (e.g. my-local-stack): my-stack
> Enter the image type you want your Llama Stack to be built as (docker or conda): conda

Llama Stack is composed of several APIs working together. Let's select
the provider types (implementations) you want to use for these APIs.

Tip: use <TAB> to see options for the providers.

> Enter provider for API inference: inline::meta-reference
> Enter provider for API safety: inline::llama-guard
> Enter provider for API agents: inline::meta-reference
> Enter provider for API memory: inline::faiss
> Enter provider for API datasetio: inline::meta-reference
> Enter provider for API scoring: inline::meta-reference
> Enter provider for API eval: inline::meta-reference
> Enter provider for API telemetry: inline::meta-reference

 > (Optional) Enter a short description for your Llama Stack:

You can now edit ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml`

:::

:::{tab-item} Building from a template

  • To build from alternative API providers, we provide distribution templates for users to get started building a distribution backed by different providers.

The following command will allow you to see the available templates and their corresponding providers.

llama stack build --list-templates
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| Template Name                | Providers                              | Description                                                                 |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| tgi                          | {                                      | Use (an external) TGI server for running LLM inference                      |
|                              |   "inference": [                       |                                                                             |
|                              |     "remote::tgi"                      |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "memory": [                          |                                                                             |
|                              |     "inline::faiss",                   |                                                                             |
|                              |     "remote::chromadb",                |                                                                             |
|                              |     "remote::pgvector"                 |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "safety": [                          |                                                                             |
|                              |     "inline::llama-guard"              |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "agents": [                          |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "telemetry": [                       |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ]                                    |                                                                             |
|                              | }                                      |                                                                             |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| remote-vllm                  | {                                      | Use (an external) vLLM server for running LLM inference                     |
|                              |   "inference": [                       |                                                                             |
|                              |     "remote::vllm"                     |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "memory": [                          |                                                                             |
|                              |     "inline::faiss",                   |                                                                             |
|                              |     "remote::chromadb",                |                                                                             |
|                              |     "remote::pgvector"                 |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "safety": [                          |                                                                             |
|                              |     "inline::llama-guard"              |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "agents": [                          |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "telemetry": [                       |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ]                                    |                                                                             |
|                              | }                                      |                                                                             |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| vllm-gpu                     | {                                      | Use a built-in vLLM engine for running LLM inference                        |
|                              |   "inference": [                       |                                                                             |
|                              |     "inline::vllm"                     |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "memory": [                          |                                                                             |
|                              |     "inline::faiss",                   |                                                                             |
|                              |     "remote::chromadb",                |                                                                             |
|                              |     "remote::pgvector"                 |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "safety": [                          |                                                                             |
|                              |     "inline::llama-guard"              |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "agents": [                          |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "telemetry": [                       |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ]                                    |                                                                             |
|                              | }                                      |                                                                             |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| meta-reference-quantized-gpu | {                                      | Use Meta Reference with fp8, int4 quantization for running LLM inference    |
|                              |   "inference": [                       |                                                                             |
|                              |     "inline::meta-reference-quantized" |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "memory": [                          |                                                                             |
|                              |     "inline::faiss",                   |                                                                             |
|                              |     "remote::chromadb",                |                                                                             |
|                              |     "remote::pgvector"                 |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "safety": [                          |                                                                             |
|                              |     "inline::llama-guard"              |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "agents": [                          |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "telemetry": [                       |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ]                                    |                                                                             |
|                              | }                                      |                                                                             |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| meta-reference-gpu           | {                                      | Use Meta Reference for running LLM inference                                |
|                              |   "inference": [                       |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "memory": [                          |                                                                             |
|                              |     "inline::faiss",                   |                                                                             |
|                              |     "remote::chromadb",                |                                                                             |
|                              |     "remote::pgvector"                 |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "safety": [                          |                                                                             |
|                              |     "inline::llama-guard"              |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "agents": [                          |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "telemetry": [                       |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ]                                    |                                                                             |
|                              | }                                      |                                                                             |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| hf-serverless                | {                                      | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|                              |   "inference": [                       |                                                                             |
|                              |     "remote::hf::serverless"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "memory": [                          |                                                                             |
|                              |     "inline::faiss",                   |                                                                             |
|                              |     "remote::chromadb",                |                                                                             |
|                              |     "remote::pgvector"                 |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "safety": [                          |                                                                             |
|                              |     "inline::llama-guard"              |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "agents": [                          |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "telemetry": [                       |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ]                                    |                                                                             |
|                              | }                                      |                                                                             |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| together                     | {                                      | Use Together.AI for running LLM inference                                   |
|                              |   "inference": [                       |                                                                             |
|                              |     "remote::together"                 |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "memory": [                          |                                                                             |
|                              |     "inline::faiss",                   |                                                                             |
|                              |     "remote::chromadb",                |                                                                             |
|                              |     "remote::pgvector"                 |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "safety": [                          |                                                                             |
|                              |     "inline::llama-guard"              |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "agents": [                          |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "telemetry": [                       |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ]                                    |                                                                             |
|                              | }                                      |                                                                             |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| ollama                       | {                                      | Use (an external) Ollama server for running LLM inference                   |
|                              |   "inference": [                       |                                                                             |
|                              |     "remote::ollama"                   |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "memory": [                          |                                                                             |
|                              |     "inline::faiss",                   |                                                                             |
|                              |     "remote::chromadb",                |                                                                             |
|                              |     "remote::pgvector"                 |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "safety": [                          |                                                                             |
|                              |     "inline::llama-guard"              |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "agents": [                          |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "telemetry": [                       |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ]                                    |                                                                             |
|                              | }                                      |                                                                             |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| bedrock                      | {                                      | Use AWS Bedrock for running LLM inference and safety                        |
|                              |   "inference": [                       |                                                                             |
|                              |     "remote::bedrock"                  |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "memory": [                          |                                                                             |
|                              |     "inline::faiss",                   |                                                                             |
|                              |     "remote::chromadb",                |                                                                             |
|                              |     "remote::pgvector"                 |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "safety": [                          |                                                                             |
|                              |     "remote::bedrock"                  |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "agents": [                          |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "telemetry": [                       |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ]                                    |                                                                             |
|                              | }                                      |                                                                             |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| hf-endpoint                  | {                                      | Use (an external) Hugging Face Inference Endpoint for running LLM inference |
|                              |   "inference": [                       |                                                                             |
|                              |     "remote::hf::endpoint"             |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "memory": [                          |                                                                             |
|                              |     "inline::faiss",                   |                                                                             |
|                              |     "remote::chromadb",                |                                                                             |
|                              |     "remote::pgvector"                 |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "safety": [                          |                                                                             |
|                              |     "inline::llama-guard"              |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "agents": [                          |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "telemetry": [                       |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ]                                    |                                                                             |
|                              | }                                      |                                                                             |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| fireworks                    | {                                      | Use Fireworks.AI for running LLM inference                                  |
|                              |   "inference": [                       |                                                                             |
|                              |     "remote::fireworks"                |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "memory": [                          |                                                                             |
|                              |     "inline::faiss",                   |                                                                             |
|                              |     "remote::chromadb",                |                                                                             |
|                              |     "remote::pgvector"                 |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "safety": [                          |                                                                             |
|                              |     "inline::llama-guard"              |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "agents": [                          |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "telemetry": [                       |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ]                                    |                                                                             |
|                              | }                                      |                                                                             |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+
| cerebras                     | {                                      | Use Cerebras for running LLM inference                                      |
|                              |   "inference": [                       |                                                                             |
|                              |     "remote::cerebras"                 |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "safety": [                          |                                                                             |
|                              |     "inline::llama-guard"              |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "memory": [                          |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "agents": [                          |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ],                                   |                                                                             |
|                              |   "telemetry": [                       |                                                                             |
|                              |     "inline::meta-reference"           |                                                                             |
|                              |   ]                                    |                                                                             |
|                              | }                                      |                                                                             |
+------------------------------+----------------------------------------+-----------------------------------------------------------------------------+

You may then pick a template to build your distribution with providers fitted to your liking.

For example, to build a distribution with TGI as the inference provider, you can run:

llama stack build --template tgi
$ llama stack build --template tgi
...
You can now edit ~/.llama/distributions/llamastack-tgi/tgi-run.yaml and run `llama stack run ~/.llama/distributions/llamastack-tgi/tgi-run.yaml`

:::

:::{tab-item} Building from a pre-existing build config file

  • In addition to templates, you may customize the build to your liking through editing config files and build from config files with the following command.

  • The config file will be of contents like the ones in llama_stack/templates/*build.yaml.

$ cat llama_stack/templates/ollama/build.yaml

name: ollama
distribution_spec:
  description: Like local, but use ollama for running LLM inference
  providers:
    inference: remote::ollama
    memory: inline::faiss
    safety: inline::llama-guard
    agents: meta-reference
    telemetry: meta-reference
image_type: conda
llama stack build --config llama_stack/templates/ollama/build.yaml

:::

:::{tab-item} Building Docker

Tip

Podman is supported as an alternative to Docker. Set DOCKER_BINARY to podman in your environment to use Podman.

To build a docker image, you may start off from a template and use the --image-type docker flag to specify docker as the build image type.

llama stack build --template ollama --image-type docker
$ llama stack build --template ollama --image-type docker
...
Dockerfile created successfully in /tmp/tmp.viA3a3Rdsg/DockerfileFROM python:3.10-slim
...

You can now edit ~/meta-llama/llama-stack/tmp/configs/ollama-run.yaml and run `llama stack run ~/meta-llama/llama-stack/tmp/configs/ollama-run.yaml`

After this step is successful, you should be able to find the built docker image and test it with llama stack run <path/to/run.yaml>. :::

::::

Running your Stack server

Now, let's start the Llama Stack Distribution Server. You will need the YAML configuration file which was written out at the end by the llama stack build step.

llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml
$ llama stack run ~/.llama/distributions/llamastack-my-local-stack/my-local-stack-run.yaml

Serving API inspect
 GET /health
 GET /providers/list
 GET /routes/list
Serving API inference
 POST /inference/chat_completion
 POST /inference/completion
 POST /inference/embeddings
...
Serving API agents
 POST /agents/create
 POST /agents/session/create
 POST /agents/turn/create
 POST /agents/delete
 POST /agents/session/delete
 POST /agents/session/get
 POST /agents/step/get
 POST /agents/turn/get

Listening on ['::', '0.0.0.0']:5000
INFO:     Started server process [2935911]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:5000 (Press CTRL+C to quit)
INFO:     2401:db00:35c:2d2b:face:0:c9:0:54678 - "GET /models/list HTTP/1.1" 200 OK

Troubleshooting

If you encounter any issues, search through our GitHub Issues, or file an new issue.