# Quick Start In this guide, we'll walk through how you can use the Llama Stack (server and client SDK) to test a simple RAG agent. A Llama Stack agent is a simple integrated system that can perform tasks by combining a Llama model for reasoning with tools (e.g., RAG, web search, code execution, etc.) for taking actions. In Llama Stack, we provide a server exposing multiple APIs. These APIs are backed by implementations from different providers. For this guide, we will use [Ollama](https://ollama.com/) as the inference provider. ### 1. Start Ollama ```bash ollama run llama3.2:3b-instruct-fp16 --keepalive 60m ``` By default, Ollama keeps the model loaded in memory for 5 minutes which can be too short. We set the `--keepalive` flag to 60 minutes to ensure the model remains loaded for sometime. ```{admonition} Note :class: tip If you do not have ollama, you can install it from [here](https://ollama.com/download). ``` ### 2. Use `uv` to install and run Llama Stack Install [uv](https://docs.astral.sh/uv/) to setup your virtual environment ```bash curl -LsSf https://astral.sh/uv/install.sh | sh ``` Setup venv ```bash uv venv --python 3.10 source .venv/bin/activate ``` Install llama stack ```bash uv pip install llama-stack ``` Build llama stack for ollama ```bash llama stack build --template ollama --image-type venv ``` Run llama stack ```bash # Use the model from ollama. Run `ollama ps` to see if its still running INFERENCE_MODEL=llama3.2:3b-instruct-fp16 \ llama stack run ollama --image-type venv ``` You will see the output like below: ``` ... INFO: Application startup complete. INFO: Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit) ``` Now you can use the llama stack client to run inference and build agents! :::{dropdown} Installing the Llama Stack client CLI and SDK Open a new terminal and navigate to the same directory you started the server from. Setup venv (llama-stack already includes the client package) ```bash source .venv/bin/activate ``` Let's use the `llama-stack-client` CLI to check the connectivity to the server. ```bash llama-stack-client configure --endpoint http://localhost:$LLAMA_STACK_PORT --api-key none ``` You will see the below: ``` Done! You can now use the Llama Stack Client CLI with endpoint http://localhost:8321 ``` List the models ``` llama-stack-client models list ``` ``` Available Models ┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ ┃ model_type ┃ identifier ┃ provider_resource_id ┃ metadata ┃ provider_id ┃ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ embedding │ all-MiniLM-L6-v2 │ all-minilm:latest │ {'embedding_dimension': 384.0} │ ollama │ ├─────────────────┼─────────────────────────────────────┼─────────────────────────────────────┼───────────────────────────────────────────┼─────────────────┤ │ llm │ llama3.2:3b-instruct-fp16 │ llama3.2:3b-instruct-fp16 │ │ ollama │ └─────────────────┴─────────────────────────────────────┴─────────────────────────────────────┴───────────────────────────────────────────┴─────────────────┘ Total models: 2 ``` You can test basic Llama inference completion using the CLI too. ```bash llama-stack-client inference chat-completion --message "tell me a joke" ``` ``` ChatCompletionResponse( completion_message=CompletionMessage( content="Here's one:\n\nWhat do you call a fake noodle?\n\nAn impasta!", role='assistant', stop_reason='end_of_turn', tool_calls=[] ), logprobs=None, metrics=[ Metric(metric='prompt_tokens', value=14.0, unit=None), Metric(metric='completion_tokens', value=27.0, unit=None), Metric(metric='total_tokens', value=41.0, unit=None) ] ) ``` :::   ### 3. Run inference with Python SDK Here is a simple example to perform chat completions using the SDK. ```python ## lstest.py from llama_stack_client import LlamaStackClient client = LlamaStackClient(base_url=f"http://localhost:8321") # List available models models = client.models.list() # Find the first LLM llm = next(m for m in models if m.model_type == 'llm') model_id = llm.identifier print("Model:", model_id) response = client.inference.chat_completion( model_id=model_id, messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a haiku about coding"}, ], ) print(response.completion_message.content) ``` ```bash python lstest.py ``` ``` Model: llama3.2:3b-instruct-fp16 Here is a haiku about coding: Lines of code unfold Logic flows through digital night Beauty in the bits ``` ### 4. Your first agent ```python ## lsagent.py from llama_stack_client import LlamaStackClient from llama_stack_client import Agent, AgentEventLogger import uuid client = LlamaStackClient(base_url=f"http://localhost:8321") models = client.models.list() llm = next(m for m in models if m.model_type == 'llm') model_id = llm.identifier agent = Agent(client, model=model_id, instructions="You are a helpful assistant that can answer questions about the Torchtune project." ) s_id = agent.create_session(session_name=f"s{uuid.uuid4()}") # Non-streaming example print("Non-streaming ...") response = agent.create_turn( messages=[ { "role": "user", "content": "Who are you?" }], session_id=s_id, stream=False ) print("agent>", response.output_message.content) # Streamining with print helper print("Streaming with print helper...") stream = agent.create_turn( messages=[ { "role": "user", "content": "Who are you?" }], session_id=s_id, stream=True ) for event in AgentEventLogger().log(stream): event.print() # Streaming example print("Streaming ...") stream = agent.create_turn( messages=[ { "role": "user", "content": "Who are you?" }], session_id=s_id, stream=True ) for event in stream: print(event) ``` **Run the agent** ```bash python lsagent.py ``` Sample output ``` Non-streaming ... agent> I'm an AI assistant, and I'll be happy to help with any questions or information you have about the Torchtune project. For those who may not know, Torchtune is a popular open-source music composition tool that allows users to create and share musical compositions using a unique visual interface. It's designed to make music creation more accessible and fun for everyone, regardless of their musical background or experience level. What would you like to know about Torchtune? Are you looking for information on how to use the software, tutorials, or perhaps something else? Streaming with print helper... inference> I am an AI assistant specifically designed to provide information and support related to the Torchtune project. I don't have a personal identity in the classical sense, but I'm here to help answer your questions, provide guidance, and offer assistance with any topics related to Torchtune. I've been trained on a vast amount of text data, including documentation, tutorials, and community discussions about Torchtune, which enables me to provide accurate and up-to-date information. My goal is to be helpful and informative, so feel free to ask me anything you'd like to know about Torchtune! Streaming ... AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepStartPayload(event_type='step_start', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference', metadata={}))) AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepProgressPayload(delta=TextDelta(text='I', type='text'), event_type='step_progress', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference'))) AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepProgressPayload(delta=TextDelta(text=' am', type='text'), event_type='step_progress', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference'))) ... AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepProgressPayload(delta=TextDelta(text='!', type='text'), event_type='step_progress', step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference'))) AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseStepCompletePayload(event_type='step_complete', step_details=InferenceStep(api_model_response=CompletionMessage(content="I am an artificial intelligence language model designed to assist with a wide range of topics, including the Torchtune project. I'm a computer program created through a process called deep learning, which allows me to understand and generate human-like text.\n\nMy primary function is to provide information, answer questions, and engage in conversation to the best of my abilities based on my training data. I don't have personal experiences, emotions, or consciousness like humans do, but I'm designed to be helpful and informative.\n\nIn the context of Torchtune, I can help with topics such as:\n\n* Providing tutorials and guides\n* Answering questions about the software's features and functionality\n* Offering tips and tricks for using Torchtune effectively\n* Discussing music theory and composition concepts related to Torchtune\n\nFeel free to ask me anything about Torchtune or any other topic, and I'll do my best to help!", role='assistant', stop_reason='end_of_turn', tool_calls=[]), step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference', turn_id='2f0921b0-ece7-4d63-bfde-87f0b08a206a', completed_at=datetime.datetime(2025, 3, 29, 18, 32, 12, 976952, tzinfo=TzInfo(UTC)), started_at=datetime.datetime(2025, 3, 29, 18, 32, 4, 840716, tzinfo=TzInfo(UTC))), step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference'))) AgentTurnResponseStreamChunk(event=TurnResponseEvent(payload=AgentTurnResponseTurnCompletePayload(event_type='turn_complete', turn=Turn(input_messages=[UserMessage(content='Who are you?', role='user', context=None)], output_message=CompletionMessage(content="I am an artificial intelligence language model designed to assist with a wide range of topics, including the Torchtune project. I'm a computer program created through a process called deep learning, which allows me to understand and generate human-like text.\n\nMy primary function is to provide information, answer questions, and engage in conversation to the best of my abilities based on my training data. I don't have personal experiences, emotions, or consciousness like humans do, but I'm designed to be helpful and informative.\n\nIn the context of Torchtune, I can help with topics such as:\n\n* Providing tutorials and guides\n* Answering questions about the software's features and functionality\n* Offering tips and tricks for using Torchtune effectively\n* Discussing music theory and composition concepts related to Torchtune\n\nFeel free to ask me anything about Torchtune or any other topic, and I'll do my best to help!", role='assistant', stop_reason='end_of_turn', tool_calls=[]), session_id='a705b5a1-b9a6-4cf5-a99a-7917cc093755', started_at=datetime.datetime(2025, 3, 29, 18, 32, 4, 840680, tzinfo=TzInfo(UTC)), steps=[InferenceStep(api_model_response=CompletionMessage(content="I am an artificial intelligence language model designed to assist with a wide range of topics, including the Torchtune project. I'm a computer program created through a process called deep learning, which allows me to understand and generate human-like text.\n\nMy primary function is to provide information, answer questions, and engage in conversation to the best of my abilities based on my training data. I don't have personal experiences, emotions, or consciousness like humans do, but I'm designed to be helpful and informative.\n\nIn the context of Torchtune, I can help with topics such as:\n\n* Providing tutorials and guides\n* Answering questions about the software's features and functionality\n* Offering tips and tricks for using Torchtune effectively\n* Discussing music theory and composition concepts related to Torchtune\n\nFeel free to ask me anything about Torchtune or any other topic, and I'll do my best to help!", role='assistant', stop_reason='end_of_turn', tool_calls=[]), step_id='7d40b848-3ba9-419b-86d9-942fd65698e2', step_type='inference', turn_id='2f0921b0-ece7-4d63-bfde-87f0b08a206a', completed_at=datetime.datetime(2025, 3, 29, 18, 32, 12, 976952, tzinfo=TzInfo(UTC)), started_at=datetime.datetime(2025, 3, 29, 18, 32, 4, 840716, tzinfo=TzInfo(UTC)))], turn_id='2f0921b0-ece7-4d63-bfde-87f0b08a206a', completed_at=datetime.datetime(2025, 3, 29, 18, 32, 12, 987353, tzinfo=TzInfo(UTC)), output_attachments=[])))) ``` ### 5. RAG agent ```python ## rag_agent.py from llama_stack_client import LlamaStackClient from llama_stack_client import Agent, AgentEventLogger from llama_stack_client.types import Document import uuid client = LlamaStackClient(base_url=f"http://localhost:8321") # Create a vector database instance embedlm = next(m for m in client.models.list() if m.model_type == 'embedding') embedding_model = embedlm.identifier vdb = next(p for p in client.providers.list() if p.api == "vector_io") vector_db_id = f"v{uuid.uuid4()}" client.vector_dbs.register( provider_id=vdb.provider_id, vector_db_id=vector_db_id, embedding_model=embedding_model, ) # Create Documents urls = [ "memory_optimizations.rst", "chat.rst", "llama3.rst", "datasets.rst", "qat_finetune.rst", "lora_finetune.rst", ] documents = [ Document( document_id=f"num-{i}", content=f"https://raw.githubusercontent.com/pytorch/torchtune/main/docs/source/tutorials/{url}", mime_type="text/plain", metadata={}, ) for i, url in enumerate(urls) ] # Insert documents client.tool_runtime.rag_tool.insert( documents=documents, vector_db_id=vector_db_id, chunk_size_in_tokens=512, ) # Get the model being served llm = next(m for m in client.models.list() if m.model_type == 'llm') model = llm.identifier # Create RAG agent ragagent = Agent(client, model=model, instructions="You are a helpful assistant that can answer questions about the Torchtune project. Use the RAG tool to answer questions as needed.", tools=[{ "name": "builtin::rag", "args": {"vector_db_ids": [vector_db_id]}, }], ) s_id = ragagent.create_session( session_name=f"s{uuid.uuid4()}" ) turns = [ "what is torchtune", "tell me about dora" ] for t in turns: print("user>", t) stream = ragagent.create_turn( messages=[{ "role": "user", "content": t }], session_id=s_id, stream=True ) for chunk in stream: event_type = chunk.event.payload.event_type if event_type == 'step_progress': print(chunk.event.payload.delta.text, end='', flush=True) ``` ``` python lsragagent.py ``` Sample output: ``` user> what is torchtune inference> [knowledge_search(query='TorchTune')] tool_execution> Tool:knowledge_search Args:{'query': 'TorchTune'} tool_execution> Tool:knowledge_search Response:[TextContentItem(text='knowledge_search tool found 5 chunks:\nBEGIN of knowledge_search tool results.\n', type='text'), TextContentItem(text='Result 1:\nDocument_id:num-1\nContent: conversational data, :func:`~torchtune.datasets.chat_dataset` seems to be a good fit. ..., type='text'), TextContentItem(text='END of knowledge_search tool results.\n', type='text')] inference> Here is a high-level overview of the text: **LoRA Finetuning with PyTorch Tune** PyTorch Tune provides a recipe for LoRA (Low-Rank Adaptation) finetuning, which is a technique to adapt pre-trained models to new tasks. The recipe uses the `lora_finetune_distributed` command. ... Overall, DORA is a powerful reinforcement learning algorithm that can learn complex tasks from human demonstrations. However, it requires careful consideration of the challenges and limitations to achieve optimal results. ``` ## Next Steps - Learn more about Llama Stack [Concepts](../concepts/index.md) - Learn how to [Build Llama Stacks](../distributions/index.md) - See [References](../references/index.md) for more details about the llama CLI and Python SDK - For example applications and more detailed tutorials, visit our [llama-stack-apps](https://github.com/meta-llama/llama-stack-apps/tree/main/examples) repository.