llama-stack-mirror/docs/source/getting_started/index.md
raghotham b16fb5a92e
Update docs/source/getting_started/index.md
Co-authored-by: Francisco Arceo <farceo@redhat.com>
2025-03-31 11:40:31 -07:00

18 KiB

Quick Start

In this guide, we'll walk through how you can use the Llama Stack (server and client SDK) to test a simple RAG agent.

A Llama Stack agent is a simple integrated system that can perform tasks by combining a Llama model for reasoning with tools (e.g., RAG, web search, code execution, etc.) for taking actions.

In Llama Stack, we provide a server exposing multiple APIs. These APIs are backed by implementations from different providers. For this guide, we will use Ollama as the inference provider.

1. Start Ollama

ollama run llama3.2:3b-instruct-fp16 --keepalive 60m

By default, Ollama keeps the model loaded in memory for 5 minutes which can be too short. We set the --keepalive flag to 60 minutes to ensure the model remains loaded for sometime.

:class: tip

If you do not have ollama, you can install it from [here](https://ollama.com/download).

2. Use uv to install and run Llama Stack

Install uv to setup your virtual environment

curl -LsSf https://astral.sh/uv/install.sh | sh

Setup venv

uv venv --python 3.10
source .venv/bin/activate

Install llama stack

uv pip install llama-stack

Build llama stack for ollama

llama stack build --template ollama --image-type venv

Run llama stack

# Use the model from ollama. Run `ollama ps` to see if its still running
INFERENCE_MODEL=llama3.2:3b-instruct-fp16 \
    llama stack run ollama --image-type venv

You will see the output like below:

...
INFO:     Application startup complete.
INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)

Now you can use the llama stack client to run inference and build agents!

:::{dropdown} Installing the Llama Stack client CLI and SDK

Open a new terminal and navigate to the same directory you started the server from.

Setup venv (llama-stack already includes the client package)

source .venv/bin/activate

Let's use the llama-stack-client CLI to check the connectivity to the server.

llama-stack-client configure --endpoint http://localhost:$LLAMA_STACK_PORT --api-key none

You will see the below:

Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:8321

List the models

llama-stack-client models list
Available Models

┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ model_type      ┃ identifier                          ┃ provider_resource_id                ┃ metadata                                  ┃ provider_id     ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ embedding       │ all-MiniLM-L6-v2                    │ all-minilm:latest                   │ {'embedding_dimension': 384.0}            │ ollama          │
├─────────────────┼─────────────────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────────┼─────────────────┤
│ llm             │ llama3.2:3b-instruct-fp16           │ llama3.2:3b-instruct-fp16           │                                           │ ollama          │
└─────────────────┴─────────────────────────────────────┴─────────────────────────────────────┴───────────────────────────────────────────┴─────────────────┘

Total models: 2

You can test basic Llama inference completion using the CLI too.

llama-stack-client inference chat-completion --message "tell me a joke"
ChatCompletionResponse(
    completion_message=CompletionMessage(
        content="Here's one:\n\nWhat do you call a fake noodle?\n\nAn impasta!",
        role='assistant',
        stop_reason='end_of_turn',
        tool_calls=[]
    ),
    logprobs=None,
    metrics=[
        Metric(metric='prompt_tokens', value=14.0, unit=None),
        Metric(metric='completion_tokens', value=27.0, unit=None),
        Metric(metric='total_tokens', value=41.0, unit=None)
    ]
)

:::

 

3. Run inference with Python SDK

Here is a simple example to perform chat completions using the SDK.

## lstest.py
from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url=f"http://localhost:8321")

# List available models
models = client.models.list()

# Find the first LLM
llm = next(m for m in models if m.model_type == 'llm')
model_id = llm.identifier

print("Model:", model_id)

response = client.inference.chat_completion(
    model_id=model_id,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a haiku about coding"},
    ],
)
print(response.completion_message.content)
python lstest.py
Model: llama3.2:3b-instruct-fp16
Here is a haiku about coding:

Lines of code unfold
Logic flows through digital night
Beauty in the bits

4. Your first agent

## lsagent.py

from llama_stack_client import LlamaStackClient
from llama_stack_client import Agent, AgentEventLogger
import uuid

client = LlamaStackClient(base_url=f"http://localhost:8321")

models = client.models.list()
llm = next(m for m in models if m.model_type == 'llm')
model_id = llm.identifier

agent = Agent(client,
    model=model_id,
    instructions="You are a helpful assistant that can answer questions about the Torchtune project."
)

s_id = agent.create_session(session_name=f"s{uuid.uuid4()}")

# Non-streaming example
print("Non-streaming ...")
response = agent.create_turn(
    messages=[ {
        "role": "user",
        "content": "Who are you?"
    }],
    session_id=s_id,
    stream=False
)
print("agent>", response.output_message.content)

# Streamining with print helper
print("Streaming with print helper...")
stream = agent.create_turn(
    messages=[ {
        "role": "user",
        "content": "Who are you?"
    }],
    session_id=s_id,
    stream=True
)
for event in AgentEventLogger().log(stream):
    event.print()


# Streaming example
print("Streaming ...")
stream = agent.create_turn(
    messages=[ {
        "role": "user",
        "content": "Who are you?"
    }],
    session_id=s_id,
    stream=True
)
for event in stream:
    print(event)

Run the agent

python lsagent.py

Sample output

Non-streaming ...
agent> I'm an AI assistant, and I'll be happy to help with any questions or information you have about the Torchtune project.

For those who may not know, Torchtune is a popular open-source music composition tool that allows users to create and share musical compositions using a unique visual interface. It's designed to make music creation more accessible and fun for everyone, regardless of their musical background or experience level.

What would you like to know about Torchtune? Are you looking for information on how to use the software, tutorials, or perhaps something else?
Streaming with print helper...
inference> I am an AI assistant specifically designed to provide information and support related to the Torchtune project. I don't have a personal identity in the classical sense, but I'm here to help answer your questions, provide guidance, and offer assistance with any topics related to Torchtune.

I've been trained on a vast amount of text data, including documentation, tutorials, and community discussions about Torchtune, which enables me to provide accurate and up-to-date information. My goal is to be helpful and informative, so feel free to ask me anything you'd like to know about Torchtune!
Streaming ...
AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepStartPayload(event_type='step_start', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference', metadata={})))
AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepProgressPayload(delta=TextDelta(text='I', type='text'), event_type='step_progress', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference')))
AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepProgressPayload(delta=TextDelta(text=' am', type='text'), event_type='step_progress', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference')))
...
AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepProgressPayload(delta=TextDelta(text='!', type='text'), event_type='step_progress', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference')))
AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepCompletePayload(event_type='step_complete', step_details=InferenceStep(api_model_response=CompletionMessage(content="I am an artificial intelligence language model designed to assist with a wide range of topics, including the Torchtune project. I'm a computer program created through a process called deep learning, which allows me to understand and generate human-like text.\n\nMy primary function is to provide information, answer questions, and engage in conversation to the best of my abilities based on my training data. I don't have personal experiences, emotions, or consciousness like humans do, but I'm designed to be helpful and informative.\n\nIn the context of Torchtune, I can help with topics such as:\n\n* Providing tutorials and guides\n* Answering questions about the software's features and functionality\n* Offering tips and tricks for using Torchtune effectively\n* Discussing music theory and composition concepts related to Torchtune\n\nFeel free to ask me anything about Torchtune or any other topic, and I'll do my best to help!", role='assistant', stop_reason='end_of_turn', tool_calls=[]), step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference', turn_id='2f0921b0-ece7-4d63-bfde-87f0b08a206a', completed_at=datetime.datetime(2025, 3, 29, 18, 32, 12, 976952, tzinfo=TzInfo(UTC)), started_at=datetime.datetime(2025, 3, 29, 18, 32, 4, 840716, tzinfo=TzInfo(UTC))), step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference')))
AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseTurnCompletePayload(event_type='turn_complete', turn=Turn(input_messages=[UserMessage(content='Who are you?', role='user', context=None)], output_message=CompletionMessage(content="I am an artificial intelligence language model designed to assist with a wide range of topics, including the Torchtune project. I'm a computer program created through a process called deep learning, which allows me to understand and generate human-like text.\n\nMy primary function is to provide information, answer questions, and engage in conversation to the best of my abilities based on my training data. I don't have personal experiences, emotions, or consciousness like humans do, but I'm designed to be helpful and informative.\n\nIn the context of Torchtune, I can help with topics such as:\n\n* Providing tutorials and guides\n* Answering questions about the software's features and functionality\n* Offering tips and tricks for using Torchtune effectively\n* Discussing music theory and composition concepts related to Torchtune\n\nFeel free to ask me anything about Torchtune or any other topic, and I'll do my best to help!", role='assistant', stop_reason='end_of_turn', tool_calls=[]), session_id='a705b5a1-b9a6-4cf5-a99a-7917cc093755', started_at=datetime.datetime(2025, 3, 29, 18, 32, 4, 840680, tzinfo=TzInfo(UTC)), steps=[InferenceStep(api_model_response=CompletionMessage(content="I am an artificial intelligence language model designed to assist with a wide range of topics, including the Torchtune project. I'm a computer program created through a process called deep learning, which allows me to understand and generate human-like text.\n\nMy primary function is to provide information, answer questions, and engage in conversation to the best of my abilities based on my training data. I don't have personal experiences, emotions, or consciousness like humans do, but I'm designed to be helpful and informative.\n\nIn the context of Torchtune, I can help with topics such as:\n\n* Providing tutorials and guides\n* Answering questions about the software's features and functionality\n* Offering tips and tricks for using Torchtune effectively\n* Discussing music theory and composition concepts related to Torchtune\n\nFeel free to ask me anything about Torchtune or any other topic, and I'll do my best to help!", role='assistant', stop_reason='end_of_turn', tool_calls=[]), step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference', turn_id='2f0921b0-ece7-4d63-bfde-87f0b08a206a', completed_at=datetime.datetime(2025, 3, 29, 18, 32, 12, 976952, tzinfo=TzInfo(UTC)), started_at=datetime.datetime(2025, 3, 29, 18, 32, 4, 840716, tzinfo=TzInfo(UTC)))], turn_id='2f0921b0-ece7-4d63-bfde-87f0b08a206a', completed_at=datetime.datetime(2025, 3, 29, 18, 32, 12, 987353, tzinfo=TzInfo(UTC)), output_attachments=[]))))

5. RAG agent

## rag_agent.py

from llama_stack_client import LlamaStackClient
from llama_stack_client import Agent, AgentEventLogger
from llama_stack_client.types import Document
import uuid

client = LlamaStackClient(base_url=f"http://localhost:8321")

# Create a vector database instance
embedlm = next(m for m in client.models.list() if m.model_type == 'embedding')
embedding_model = embedlm.identifier
vdb = next(p for p in client.providers.list() if p.api == "vector_io")
vector_db_id = f"v{uuid.uuid4()}"
client.vector_dbs.register(
    provider_id=vdb.provider_id,
    vector_db_id=vector_db_id,
    embedding_model=embedding_model,
)

# Create Documents
urls = [
    "memory_optimizations.rst",
    "chat.rst",
    "llama3.rst",
    "datasets.rst",
    "qat_finetune.rst",
    "lora_finetune.rst",
]
documents = [
    Document(
        document_id=f"num-{i}",
        content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}",
        mime_type="text/plain",
        metadata={},
    )
    for i, url in enumerate(urls)
]

# Insert documents
client.tool_runtime.rag_tool.insert(
    documents=documents,
    vector_db_id=vector_db_id,
    chunk_size_in_tokens=512,
)

# Get the model being served
llm = next(m for m in client.models.list() if m.model_type == 'llm')
model = llm.identifier

# Create RAG agent
ragagent = Agent(client,
    model=model,
    instructions="You are a helpful assistant that can answer questions about the Torchtune project. Use the RAG tool to answer questions as needed.",
    tools=[{
        "name": "builtin::rag",
        "args": {"vector_db_ids": [vector_db_id]},
    }],
)

s_id = ragagent.create_session(
    session_name=f"s{uuid.uuid4()}"
)

turns = [
    "what is torchtune",
    "tell me about dora"
]

for t in turns:
    print("user>", t)
    stream = ragagent.create_turn(
        messages=[{
            "role": "user",
            "content": t
        }],
        session_id=s_id,
        stream=True
    )
    for chunk in stream:
        event_type = chunk.event.payload.event_type
        if event_type == 'step_progress':
            print(chunk.event.payload.delta.text, end='', flush=True)
python lsragagent.py

Sample output:

user> what is torchtune
inference> [knowledge_search(query='TorchTune')]
tool_execution> Tool:knowledge_search Args:{'query': 'TorchTune'}
tool_execution> Tool:knowledge_search Response:[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN of knowledge_search tool results.\n', type='text'), TextContentItem(text='Result 1:\nDocument_id:num-1\nContent:  conversational data, :func:`~torchtune.datasets.chat_dataset` seems to be a good fit. ..., type='text'), TextContentItem(text='END of knowledge_search tool results.\n', type='text')]
inference> Here is a high-level overview of the text:

**LoRA Finetuning with PyTorch Tune**

PyTorch Tune provides a recipe for LoRA (Low-Rank Adaptation) finetuning, which is a technique to adapt pre-trained models to new tasks. The recipe uses the `lora_finetune_distributed` command.
...
Overall, DORA is a powerful reinforcement learning algorithm that can learn complex tasks from human demonstrations. However, it requires careful consideration of the challenges and limitations to achieve optimal results.

Next Steps