# Building AI Applications [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1F2ksmkoGQPa4pzRjMOE6BXWeOxWFIW6n?usp=sharing) Llama Stack provides all the building blocks needed to create sophisticated AI applications. This guide will walk you through how to use these components effectively. Check out our Colab notebook on to follow along working examples on how you can build LLM-powered agentic applications using Llama Stack. ## Basic Inference The foundation of any AI application is the ability to interact with LLM models. Llama Stack provides a simple interface for both completion and chat-based inference: ```python from llama_stack_client import LlamaStackClient client = LlamaStackClient(base_url="http://localhost:5001") # List available models models = client.models.list() # Simple chat completion response = client.inference.chat_completion( model_id="Llama3.2-3B-Instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a haiku about coding"} ] ) print(response.completion_message.content) ``` ## Adding Memory & RAG Memory enables your applications to reference and recall information from previous interactions or external documents. Llama Stack's memory system is built around the concept of Memory Banks: 1. **Vector Memory Banks**: For semantic search and retrieval 2. **Key-Value Memory Banks**: For structured data storage 3. **Keyword Memory Banks**: For basic text search 4. **Graph Memory Banks**: For relationship-based retrieval Here's how to set up a vector memory bank for RAG: ```python # Register a memory bank bank_id = "my_documents" response = client.memory_banks.register( memory_bank_id=bank_id, params={ "memory_bank_type": "vector", "embedding_model": "all-MiniLM-L6-v2", "chunk_size_in_tokens": 512 } ) # Insert documents documents = [ { "document_id": "doc1", "content": "Your document text here", "mime_type": "text/plain" } ] client.memory.insert(bank_id, documents) # Query documents results = client.memory.query( bank_id=bank_id, query="What do you know about...", ) ``` ## Implementing Safety Guardrails Safety is a critical component of any AI application. Llama Stack provides a Shield system that can be applied at multiple touchpoints: ```python # Register a safety shield shield_id = "content_safety" client.shields.register( shield_id=shield_id, provider_shield_id="llama-guard-basic" ) # Run content through shield response = client.safety.run_shield( shield_id=shield_id, messages=[{"role": "user", "content": "User message here"}] ) if response.violation: print(f"Safety violation detected: {response.violation.user_message}") ``` ## Building Agents Agents are the heart of complex AI applications. They combine inference, memory, safety, and tool usage into coherent workflows. At its core, an agent follows a sophisticated execution loop that enables multi-step reasoning, tool usage, and safety checks. ### The Agent Execution Loop Each agent turn follows these key steps: 1. **Initial Safety Check**: The user's input is first screened through configured safety shields 2. **Context Retrieval**: - If RAG is enabled, the agent queries relevant documents from memory banks - For new documents, they are first inserted into the memory bank - Retrieved context is augmented to the user's prompt 3. **Inference Loop**: The agent enters its main execution loop: - The LLM receives the augmented prompt (with context and/or previous tool outputs) - The LLM generates a response, potentially with tool calls - If tool calls are present: - Tool inputs are safety-checked - Tools are executed (e.g., web search, code execution) - Tool responses are fed back to the LLM for synthesis - The loop continues until: - The LLM provides a final response without tool calls - Maximum iterations are reached - Token limit is exceeded 4. **Final Safety Check**: The agent's final response is screened through safety shields ```{mermaid} sequenceDiagram participant U as User participant E as Executor participant M as Memory Bank participant L as LLM participant T as Tools participant S as Safety Shield Note over U,S: Agent Turn Start U->>S: 1. Submit Prompt activate S S->>E: Input Safety Check deactivate S E->>M: 2.1 Query Context M-->>E: 2.2 Retrieved Documents loop Inference Loop E->>L: 3.1 Augment with Context L-->>E: 3.2 Response (with/without tool calls) alt Has Tool Calls E->>S: Check Tool Input S->>T: 4.1 Execute Tool T-->>E: 4.2 Tool Response E->>L: 5.1 Tool Response L-->>E: 5.2 Synthesized Response end opt Stop Conditions Note over E: Break if: Note over E: - No tool calls Note over E: - Max iterations reached Note over E: - Token limit exceeded end end E->>S: Output Safety Check S->>U: 6. Final Response ``` Each step in this process can be monitored and controlled through configurations. Here's an example that demonstrates monitoring the agent's execution: ```python from llama_stack_client.lib.agents.event_logger import EventLogger agent_config = AgentConfig( model="Llama3.2-3B-Instruct", instructions="You are a helpful assistant", # Enable both RAG and tool usage tools=[ { "type": "memory", "memory_bank_configs": [{ "type": "vector", "bank_id": "my_docs" }], "max_tokens_in_context": 4096 }, { "type": "code_interpreter", "enable_inline_code_execution": True } ], # Configure safety input_shields=["content_safety"], output_shields=["content_safety"], # Control the inference loop max_infer_iters=5, sampling_params={ "strategy": { "type": "top_p", "temperature": 0.7, "top_p": 0.95 }, "max_tokens": 2048 } ) agent = Agent(client, agent_config) session_id = agent.create_session("monitored_session") # Stream the agent's execution steps response = agent.create_turn( messages=[{"role": "user", "content": "Analyze this code and run it"}], attachments=[{ "content": "https://raw.githubusercontent.com/example/code.py", "mime_type": "text/plain" }], session_id=session_id ) # Monitor each step of execution for log in EventLogger().log(response): if log.event.step_type == "memory_retrieval": print("Retrieved context:", log.event.retrieved_context) elif log.event.step_type == "inference": print("LLM output:", log.event.model_response) elif log.event.step_type == "tool_execution": print("Tool call:", log.event.tool_call) print("Tool response:", log.event.tool_response) elif log.event.step_type == "shield_call": if log.event.violation: print("Safety violation:", log.event.violation) ``` This example shows how an agent can: Llama Stack provides a high-level agent framework: ```python from llama_stack_client.lib.agents.agent import Agent from llama_stack_client.types.agent_create_params import AgentConfig # Configure an agent agent_config = AgentConfig( model="Llama3.2-3B-Instruct", instructions="You are a helpful assistant", tools=[ { "type": "memory", "memory_bank_configs": [], "query_generator_config": { "type": "default", "sep": " " } } ], input_shields=["content_safety"], output_shields=["content_safety"], enable_session_persistence=True ) # Create an agent agent = Agent(client, agent_config) session_id = agent.create_session("my_session") # Run agent turns response = agent.create_turn( messages=[{"role": "user", "content": "Your question here"}], session_id=session_id ) ``` ### Adding Tools to Agents ```{toctree} :hidden: :maxdepth: 3 tools ``` Agents can be enhanced with various tools. For detailed information about available tools, their configuration, and providers, see the [Tools](tools.md) documentation. Tools are configured through the `toolgroups` parameter in the agent configuration. Each tool group can be specified either as a string or with additional arguments: ```python from llama_stack_client.lib.agents.agent import Agent from llama_stack_client.types.agent_create_params import AgentConfig agent_config = AgentConfig( model="Llama3.2-3B-Instruct", instructions="You are a helpful assistant", # Configure tool groups toolgroups=[ # Simple string format "builtin::code_interpreter", # With arguments format { "name": "builtin::websearch", "args": { "max_results": 5 } } ], tool_choice="auto", tool_prompt_format="json", # Optional safety configuration input_shields=["content_safety"], output_shields=["content_safety"], # Control the inference loop max_infer_iters=10, sampling_params={ "strategy": { "type": "top_p", "temperature": 0.7, "top_p": 0.95 }, "max_tokens": 2048 } ) agent = Agent(client, agent_config) ``` For details on available tool groups, providers, and their configuration options, refer to the [Tools](tools.md) documentation. ## Building RAG-Enhanced Agents One of the most powerful patterns is combining agents with RAG capabilities. Here's a complete example: ```python from llama_stack_client.types import Attachment # Create attachments from documents attachments = [ Attachment( content="https://raw.githubusercontent.com/example/doc.rst", mime_type="text/plain" ) ] # Configure agent with memory agent_config = AgentConfig( model="Llama3.2-3B-Instruct", instructions="You are a helpful assistant", tools=[{ "type": "memory", "memory_bank_configs": [], "query_generator_config": {"type": "default", "sep": " "}, "max_tokens_in_context": 4096, "max_chunks": 10 }], enable_session_persistence=True ) agent = Agent(client, agent_config) session_id = agent.create_session("rag_session") # Initial document ingestion response = agent.create_turn( messages=[{ "role": "user", "content": "I am providing some documents for reference." }], attachments=attachments, session_id=session_id ) # Query with RAG response = agent.create_turn( messages=[{ "role": "user", "content": "What are the key topics in the documents?" }], session_id=session_id ) ``` ## Testing & Evaluation Llama Stack provides built-in tools for evaluating your applications: 1. **Benchmarking**: Test against standard datasets 2. **Application Evaluation**: Score your application's outputs 3. **Custom Metrics**: Define your own evaluation criteria Here's how to set up basic evaluation: ```python # Create an evaluation task response = client.eval_tasks.register( eval_task_id="my_eval", dataset_id="my_dataset", scoring_functions=["accuracy", "relevance"] ) # Run evaluation job = client.eval.run_eval( task_id="my_eval", task_config={ "type": "app", "eval_candidate": { "type": "agent", "config": agent_config } } ) # Get results result = client.eval.job_result( task_id="my_eval", job_id=job.job_id ) ``` ## Debugging & Monitoring Llama Stack includes comprehensive telemetry for debugging and monitoring your applications: 1. **Tracing**: Track request flows across components 2. **Metrics**: Measure performance and usage 3. **Logging**: Debug issues and track behavior The telemetry system supports multiple output formats: - OpenTelemetry for visualization in tools like Jaeger - SQLite for local storage and querying - Console output for development Example of querying traces: ```python # Query traces for a session traces = client.telemetry.query_traces( attribute_filters=[{ "key": "session_id", "op": "eq", "value": session_id }] ) # Get spans within the root span; indexed by ID # Use parent_span_id to build a tree out of it spans_by_id = client.telemetry.get_span_tree( span_id=traces[0].root_span_id ) ``` For details on how to use the telemetry system to debug your applications, export traces to a dataset, and run evaluations, see the [Telemetry](telemetry) section. ```{toctree} :hidden: :maxdepth: 3 telemetry ```