Quickstart

This guide will walk you through the steps to set up an end-to-end workflow with Llama Stack. It focuses on building a Llama Stack distribution and starting up a Llama Stack server. See our documentation for more on Llama Stack's capabilities, or visit llama-stack-apps for example apps.

0. Prerequsite

Feel free to skip this step if you already have the prerequsite installed.

conda (steps to install)

1. Installation

The llama CLI tool helps you manage the Llama toolchain & agentic systems. After installing the llama-stack package, the llama command should be available in your path.

Install as a package: Install directly from PyPI with:

pip install llama-stack

2. Download Llama models:

llama download --model-id Llama3.1-8B-Instruct

You will have to follow the instructions in the cli to complete the download, get a instant license here: URL to license.

3. Build->Configure->Run via Conda:

For development, build a LlamaStack distribution from scratch.

llama stack build Enter build information interactively:

llama stack build

llama stack configure Run llama stack configure <name> using the name from the build step.

llama stack configure my-local-stack

llama stack run Start the server with:

llama stack run my-local-stack

4. Testing with Client

After setup, test the server with a POST request:

curl http://localhost:5000/inference/chat_completion \
-H "Content-Type: application/json" \
-d '{
    "model": "Llama3.1-8B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write me a 2-sentence poem about the moon"}
    ],
    "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
}'

5. Inference

After setup, test the server with a POST request:

curl http://localhost:5000/inference/chat_completion \
-H "Content-Type: application/json" \
-d '{
    "model": "Llama3.1-8B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write me a 2-sentence poem about the moon"}
    ],
    "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
}'

Check our client SDKs for various languages: Python, Node, Swift, and Kotlin.

Advanced Guides

For more on custom Llama Stack distributions, refer to our Building a Llama Stack Distribution guide.

Next Steps:

check out

2.9 KiB Raw Blame History