mirror of https://github.com/meta-llama/llama-stack.git synced 2025-10-15 06:37:58 +00:00

Justin Lee d200a6b002 beef up quickstart

2024-11-04 14:56:03 -08:00

5.5 KiB

Raw Blame History

Llama Stack Quickstart Guide

This guide will walk you through setting up an end-to-end workflow with Llama Stack, enabling you to perform text generation using the Llama3.2-11B-Vision-Instruct model. Follow these steps to get started quickly.

Prerequisite
Installation
Download Llama Models
Build, Configure, and Run Llama Stack
Testing with curl
Testing with Python
Next Steps

Prerequisite

Ensure you have the following installed on your system:

Conda: A package, dependency, and environment management tool.

Installation

The llama CLI tool helps you manage the Llama Stack toolchain and agent systems.

Install via PyPI:

pip install llama-stack

After installation, the llama command should be available in your PATH.

Download Llama Models

Download the necessary Llama model checkpoints using the llama CLI:

llama download --model-id Llama3.2-11B-Vision-Instruct

Follow the CLI prompts to complete the download. You may need to accept a license agreement. Obtain an instant license here.

Build, Configure, and Run Llama Stack

1. Build the Llama Stack Distribution

We will default into building a meta-reference-gpu distribution, however you could read more about the different distriubtion here.

llama stack build --template meta-reference-gpu --image-type conda

2. Run the Llama Stack Distribution

Launching a distribution initializes and configures the necessary APIs and Providers, enabling seamless interaction with the underlying model.

Start the server with the configured stack:

cd llama-stack/distributions/meta-reference-gpu
llama stack run ./run.yaml

The server will start and listen on http://localhost:5000 by default.

Testing with `curl`

After setting up the server, verify it's working by sending a POST request using curl:

curl http://localhost:5000/inference/chat_completion \
-H "Content-Type: application/json" \
-d '{
    "model": "Llama3.1-8B-Instruct",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write me a 2-sentence poem about the moon"}
    ],
    "sampling_params": {"temperature": 0.7, "seed": 42, "max_tokens": 512}
}'

Expected Output:

{
  "completion_message": {
    "role": "assistant",
    "content": "The moon glows softly in the midnight sky,\nA beacon of wonder, as it catches the eye.",
    "stop_reason": "out_of_tokens",
    "tool_calls": []
  },
  "logprobs": null
}

Testing with Python

You can also interact with the Llama Stack server using a simple Python script. Below is an example:

1. Install Required Python Packages

The llama-stack-client library offers a robust and efficient python methods for interacting with the Llama Stack server.

pip install llama-stack-client

2. Create a Python Script (`test_llama_stack.py`)

from llama_stack_client import LlamaStackClient
from llama_stack_client.types import SystemMessage, UserMessage

# Initialize the client
client = LlamaStackClient(base_url="http://localhost:5000")

# Create a chat completion request
response = client.inference.chat_completion(
    messages=[
        SystemMessage(content="You are a helpful assistant.", role="system"),
        UserMessage(content="Write me a 2-sentence poem about the moon", role="user")
    ],
    model="Llama3.1-8B-Instruct",
)

# Print the response
print(response.completion_message.content)

3. Run the Python Script

python test_llama_stack.py